Paperid: 1, https://arxiv.org/pdf/2505.24878.pdf   GitHub
Authors:Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
Title: Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Abstract:
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Chinese Summary: Open CaptchaWorld 是首个基于网络的基准测试平台,用于评估多模态大语言模型代理解决各类验证码的能力,实验显示人类表现接近完美,而顶尖模型成功率远低于人类水平,揭示了当前交互推理能力的重大不足。
English Summary: Open CaptchaWorld is introduced as the first web-based benchmark to test multimodal LLM agents' capabilities in solving diverse CAPTCHA puzzles, revealing that while humans achieve near-perfect scores, current state-of-the-art agents significantly underperform, highlighting critical gaps in interactive reasoning.

Authors:Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan
Title: Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
Abstract:
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X
中文摘要:我们推出了Agent-X基准测试,用于评估视觉中心智能体在真实多模态环境中的多步推理和工具使用能力,结果表明即使GPT、Gemini等顶尖模型在复杂任务中表现不佳,完整任务成功率不足50%。
English Summary: The Agent-X benchmark is introduced to evaluate vision-centric agents' multi-step reasoning and tool-use capabilities in real-world multimodal settings, revealing that top models like GPT and Gemini struggle with complex tasks, achieving under 50% success rates.

Authors:Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez
Title: ProxyThinker: Test-Time Guidance through Small Visual Reasoners
Abstract:
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.
中文: ProxyThinker是一种无需训练的推理时技术,通过调整解码动态使大型视觉语言模型能够继承小型慢思考推理器的视觉推理能力,在多项复杂视觉基准测试中显著提升性能,同时实现高达38倍的推理加速。
English: ProxyThinker is an inference-time technique that enables large vision-language models to acquire enhanced visual reasoning capabilities from smaller, specialized reasoners without additional training, significantly improving performance on complex visual benchmarks while accelerating inference speeds by up to 38 times.

Authors:Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez
Title: ProxyThinker: Test-Time Guidance through Small Visual Reasoners
Abstract:
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.
中文: ProxyThinker是一种无需训练的推理时技术,通过调整解码动态使大型视觉语言模型能够继承小型慢思考推理器的视觉推理能力,在多项复杂视觉基准测试中显著提升性能,同时实现高达38倍的推理加速。
English: ProxyThinker is an inference-time technique that enables large vision-language models to acquire enhanced visual reasoning capabilities from smaller, specialized reasoners without additional training, significantly improving performance on complex visual benchmarks while accelerating inference speeds by up to 38 times.

Authors:Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius
Title: SiLVR: A Simple Language-based Video Reasoning Framework
Abstract:
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.
中文: SiLVR 是一个简单、模块化且无需训练的视频推理框架,它将复杂视频理解分解为两个阶段:先将原始视频转化为基于语言的表征,再借助强大的推理大语言模型完成任务,并在多个基准测试中取得了最优性能。
English: SiLVR is a simple, modular, and training-free video reasoning framework that decomposes complex video understanding into two stages—first converting raw video into language-based representations and then using a powerful reasoning LLM to solve tasks—achieving state-of-the-art results on multiple benchmarks.

Authors:Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Abstract:
Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, no character reference, or single-image cases, and fall short of real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt adherence, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a high-fidelity, multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
中文:ViStoryBench是一个全面的基准测试,旨在评估故事可视化模型在多样化叙事、风格和角色设定下的表现,它结合了人工验证的脚本和自动化指标来评估一致性、质量及生成瑕疵,从而推动视觉叙事领域的系统性分析与进步。
English: ViStoryBench is a comprehensive benchmark designed to evaluate story visualization models across diverse narratives, styles, and character settings, featuring human-verified scripts and automated metrics to assess consistency, quality, and artifacts, thereby enabling systematic analysis and advancement in visual storytelling.

Authors:Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi
Title: Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Abstract:
Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
中文: 本文提出强化蒸馏(REDI)框架,通过两阶段学习同时利用正负推理轨迹来增强模型推理能力,在数学任务上使用公开数据训练的1.5B模型取得了最优性能。
English: This paper introduces Reinforcement Distillation (REDI), a two-stage framework that leverages both positive and negative reasoning traces to enhance model reasoning, achieving state-of-the-art performance for 1.5B models on mathematical tasks with openly available data.

Authors:Wanyun Xie, Francesco Tonin, Volkan Cevher
Title: Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning
Abstract:
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.
Chinese: Chameleon框架通过利用嵌入空间中的杠杆分数动态调整训练领域权重,无需昂贵重训练即可适应新数据,在预训练、迁移学习和微调场景中持续提升语言模型性能。
English: The Chameleon framework efficiently improves language model performance by using leverage scores to dynamically reweight training domains in embedding space, enabling seamless adaptation to new data without costly retraining across pretraining, transfer learning, and fine-tuning scenarios.

Authors:Fuyuan Lyu, Linfeng Du, Yunpeng Weng, Qiufang Ying, Zhiyan Xu, Wen Zou, Haolun Wu, Xiuqiang He, Xing Tang
Title: Timing is Important: Risk-aware Fund Allocation based on Time-Series Forecasting
Abstract:
Fund allocation has been an increasingly important problem in the financial domain. In reality, we aim to allocate the funds to buy certain assets within a certain future period. Naive solutions such as prediction-only or Predict-then-Optimize approaches suffer from goal mismatch. Additionally, the introduction of the SOTA time series forecasting model inevitably introduces additional uncertainty in the predicted result. To solve both problems mentioned above, we introduce a Risk-aware Time-Series Predict-and-Allocate (RTS-PnO) framework, which holds no prior assumption on the forecasting models. Such a framework contains three features: (i) end-to-end training with objective alignment measurement, (ii) adaptive forecasting uncertainty calibration, and (iii) agnostic towards forecasting models. The evaluation of RTS-PnO is conducted over both online and offline experiments. For offline experiments, eight datasets from three categories of financial applications are used: Currency, Stock, and Cryptos. RTS-PnO consistently outperforms other competitive baselines. The online experiment is conducted on the Cross-Border Payment business at FiT, Tencent, and an 8.4\% decrease in regret is witnessed when compared with the product-line approach. The code for the offline experiment is available at https://github.com/fuyuanlyu/RTS-PnO.
中文: RTS-PnO框架通过端到端训练、自适应预测不确定性校准和模型无关特性,解决了资金分配中的目标不匹配和预测不确定性问题,在离线和在线金融实验中均表现出优越性能。
English: The RTS-PnO framework addresses goal mismatch and forecasting uncertainty in fund allocation by integrating end-to-end training, adaptive uncertainty calibration, and model-agnostic features, demonstrating superior performance in both offline and online financial experiments.

Authors:Li yunhan, Wu gengshen
Title: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Abstract:
As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.
中文: 本研究通过开发回归模型、构建专业法律问题集并分析49个大型语言模型,填补了法律大模型语言质量评估的空白,发现模型性能在140亿参数时趋于稳定,推理模型优于基础架构,且Qwen3系列在成本效益方面表现最佳。
English: This study addresses the gap in evaluating linguistic quality of legal LLMs by developing a regression model, specialized legal questions, and analyzing 49 models, revealing that performance plateaus at 14B parameters and reasoning models outperform base architectures, with the Qwen3 series identified as optimal for cost-performance balance.

Authors:Marta López-Rauhut, Hongyu Zhou, Mathieu Aubry, Loic Landrieu
Title: Segmenting France Across Four Centuries
Abstract:
Historical maps offer an invaluable perspective into territory evolution across past centuries--long before satellite or remote sensing technologies existed. Deep learning methods have shown promising results in segmenting historical maps, but publicly available datasets typically focus on a single map type or period, require extensive and costly annotations, and are not suited for nationwide, long-term analyses. In this paper, we introduce a new dataset of historical maps tailored for analyzing large-scale, long-term land use and land cover evolution with limited annotations. Spanning metropolitan France (548,305 km^2), our dataset contains three map collections from the 18th, 19th, and 20th centuries. We provide both comprehensive modern labels and 22,878 km^2 of manually annotated historical labels for the 18th and 19th century maps. Our dataset illustrates the complexity of the segmentation task, featuring stylistic inconsistencies, interpretive ambiguities, and significant landscape changes (e.g., marshlands disappearing in favor of forests). We assess the difficulty of these challenges by benchmarking three approaches: a fully-supervised model trained with historical labels, and two weakly-supervised models that rely only on modern annotations. The latter either use the modern labels directly or first perform image-to-image translation to address the stylistic gap between historical and contemporary maps. Finally, we discuss how these methods can support long-term environment monitoring, offering insights into centuries of landscape transformation. Our official project repository is publicly available at https://github.com/Archiel19/FRAx4.git.
Chinese: 本文提出一个涵盖法国三个世纪历史地图的新数据集,旨在通过有限标注分析长期土地利用变化,并评估了多种分割方法以应对风格不一致和景观变迁等挑战。
English: This paper introduces a new dataset of historical maps for France spanning three centuries, designed to analyze long-term land use changes with limited annotations, and benchmarks segmentation methods to address challenges like stylistic inconsistencies and landscape transformations.

Authors:Marc González, Rachid Guerraoui, Rafael Pinot, Geovani Rizk, John Stephan, François Taïani
Title: ByzFL: Research Framework for Robust Federated Learning
Abstract:
We present ByzFL, an open-source Python library for developing and benchmarking robust federated learning (FL) algorithms. ByzFL provides a unified and extensible framework that includes implementations of state-of-the-art robust aggregators, a suite of configurable attacks, and tools for simulating a variety of FL scenarios, including heterogeneous data distributions, multiple training algorithms, and adversarial threat models. The library enables systematic experimentation via a single JSON-based configuration file and includes built-in utilities for result visualization. Compatible with PyTorch tensors and NumPy arrays, ByzFL is designed to facilitate reproducible research and rapid prototyping of robust FL solutions. ByzFL is available at https://byzfl.epfl.ch/, with source code hosted on GitHub: https://github.com/LPD-EPFL/byzfl.
ByzFL 是一个开源的 Python 库,为开发和评估鲁棒的联邦学习算法提供了统一框架,包含可配置攻击、模拟场景及可视化工具,以支持可重复性研究。
ByzFL is an open-source Python library offering a unified framework for developing and benchmarking robust federated learning algorithms, featuring configurable attacks, simulations, and visualization tools to support reproducible research.

Authors:Zimu Liao, Jifeng Ding, Rong Fu, Siwei Cui, Ruixuan Gong, Li Wang, Boni Hu, Yi Wang, Hengjie Li, XIngcheng Zhang, Hui Wang
Title: TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores
Abstract:
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the time cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands Tensor Core (TCU) applicability for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms tightly coupled with rendering pipeline designs, like Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thus reaching up to a total 5.6x acceleration. The code is currently available at anonymous \href{https://github.com/TensorCore3DGS/3DGSTensorCore}
中文: 本文提出TC-GS通用模块,通过将alpha计算映射为矩阵乘法来利用张量核心加速3D高斯泼溅,在保持渲染质量的同时实现了2.18倍的加速效果。
English: This paper introduces TC-GS, a universal module that leverages Tensor Cores to accelerate 3D Gaussian Splatting by mapping alpha computation to matrix multiplication, achieving a 2.18x speedup while maintaining rendering quality.

Authors:Yucheng Zhou, Jiahao Yuan, Qianning Wang
Title: Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Abstract:
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.
中文: 针对文本到图像模型处理复杂指令的不足,我们推出了LongBench-T2I基准测试和Plan2Gen代理框架,无需额外训练即可提升生成与评估能力。
English: Recent text-to-image models face challenges with complex prompts, prompting the introduction of LongBench-T2I, a comprehensive benchmark and Plan2Gen agent framework to enhance generation and evaluation without additional training.

Authors:Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, Ang Chen
Title: EXP-Bench: Can AI Conduct AI Research Experiments?
Abstract:
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.
Chinese: EXP-Bench 是一个新颖的基准测试,旨在评估AI代理执行完整研究实验的能力,结果显示当前代理在完整实验中的成功率仅为0.5%,尽管在个别实验环节展现出部分能力。
English: EXP-Bench is a novel benchmark designed to evaluate AI agents' ability to conduct complete research experiments, revealing that current agents achieve only a 0.5% success rate in executing full experiments despite partial capabilities in individual aspects.

Authors:Max Conti, Manuel Faysse, Gautier Viaud, Antoine Bosselut, Céline Hudelot, Pierre Colombo
Title: Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
Abstract:
A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.
中文: 现代文档检索嵌入方法常忽略文档整体上下文,为此我们提出ConTEB基准评估上下文利用能力,并开发InSeNT后训练方法,在保持效率的同时显著提升检索质量。
English: Modern document retrieval embedding methods often fail to incorporate broader document context, so we introduce ConTEB to evaluate context usage and propose InSeNT, a post-training method that improves retrieval quality while maintaining efficiency.

Authors:Yidong Luo, Chenguang Wang, Jiahao Yang, Fanzeng Xia, Tianshu Yu
Title: EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation
Abstract:
Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems. The proliferation of MILP instance generation methods, driven by machine learning's demand for diverse optimization datasets and the limitations of static benchmarks, has significantly outpaced standardized evaluation techniques. Consequently, assessing the fidelity and utility of synthetic MILP instances remains a critical, multifaceted challenge. This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods. Our framework provides a unified and extensible methodology, assessing instance quality across crucial dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream machine learning tasks. A key innovation is its in-depth analysis of solver-internal features -- particularly by comparing distributions of key solver outputs including root node gap, heuristic success rates, and cut plane usage -- leveraging the solver's dynamic solution behavior as an `expert assessment' to reveal nuanced computational resemblances. By offering a structured approach with clearly defined solver-independent and solver-dependent metrics, our benchmark aims to facilitate robust comparisons among diverse generation techniques, spur the development of higher-quality instance generators, and ultimately enhance the reliability of research reliant on synthetic MILP data. The framework's effectiveness in systematically comparing the fidelity of instance sets is demonstrated using contemporary generative models.
中文摘要:本文提出一个综合性基准框架,通过数学有效性、结构相似性、计算复杂度和下游应用价值四个维度,结合求解器内部特征分析,系统评估混合整数线性规划实例生成方法的质量。
English Summary: This paper introduces a comprehensive benchmark framework for systematically evaluating Mixed-Integer Linear Programming instance generation methods, assessing mathematical validity, structural similarity, computational hardness, and downstream utility through novel solver-internal feature analysis.

Authors:Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
Title: Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?
Abstract:
As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarCon.
中文摘要:本研究评估了认知标记词在反映大语言模型内在置信度方面的可靠性,发现标记词在同一数据分布内能保持稳定的准确性,但在分布外场景中表现出不稳定性,这对基于标记词的置信度评估的可信度提出了重要关切。
English Summary: This study evaluates the reliability of epistemic markers in reflecting large language models' intrinsic confidence, finding that while markers maintain consistent accuracy within the same data distribution, they exhibit instability in out-of-distribution scenarios, highlighting concerns about their trustworthiness for confidence estimation.

Authors:Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf
Title: REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Abstract:
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
中文: Reasoning Gym是一个强化学习推理环境库,通过100多个可验证奖励的领域生成器实现无限可调难度的训练数据,有效支持推理模型的评估与训练。
English: Reasoning Gym is a reinforcement learning library featuring over 100 procedurally generated environments with verifiable rewards across multiple domains, enabling infinite adjustable-difficulty training data for effective model evaluation and training.

Authors:Yingchaojie Feng, Yiqun Sun, Yandong Sun, Minfeng Zhu, Qiang Huang, Anthony K. H. Tung, Wei Chen
Title: Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation
Abstract:
In this work, we investigate an important task named instruction-following text embedding, which generates dynamic text embeddings that adapt to user instructions, highlighting specific attributes of text. Despite recent advancements, existing approaches suffer from significant computational overhead, as they require re-encoding the entire corpus for each new instruction. To address this challenge, we propose GSTransform, a novel instruction-following text embedding framework based on Guided Space Transformation. Our key observation is that instruction-relevant information is inherently encoded in generic embeddings but remains underutilized. Instead of repeatedly encoding the corpus for each instruction, GSTransform is a lightweight transformation mechanism that adapts pre-computed embeddings in real time to align with user instructions, guided by a small amount of text data with instruction-focused label annotation. We conduct extensive experiments on three instruction-awareness downstream tasks across nine real-world datasets, demonstrating that GSTransform improves instruction-following text embedding quality over state-of-the-art methods while achieving dramatic speedups of 6~300x in real-time processing on large-scale datasets. The source code is available at https://github.com/YingchaojieFeng/GSTransform.
中文: 本文提出GSTransform框架,通过引导空间变换将预计算文本嵌入动态适配用户指令,在提升嵌入质量的同时实现比现有方法快6-300倍的实时处理速度。
English: This paper introduces GSTransform, a lightweight framework that dynamically adapts pre-computed text embeddings to user instructions through guided space transformation, achieving superior embedding quality and 6-300x faster processing than existing methods.

Authors:Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian
Title: Tackling View-Dependent Semantics in 3D Language Gaussian Splatting
Abstract:
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.
中文: LaGa通过将3D场景分解为对象并构建聚合语义表征,有效解决了3D高斯溅射中视角依赖语义的挑战,在LERF-OVS数据集上相比之前最优方法实现了18.7% mIoU的显著提升。
English: LaGa addresses the challenge of view-dependent semantics in 3D Gaussian Splatting by decomposing scenes into objects and constructing aggregated semantic representations, achieving a notable +18.7% mIoU improvement over previous methods on the LERF-OVS dataset.

Authors:Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua
Title: Reinforcing Video Reasoning with Focused Thinking
Abstract:
Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.
中文:TW-GRPO通过引入令牌加权机制聚焦关键信息和采用密集奖励评估部分正确性,显著提升了多模态推理能力,在CLEVRER和MMVU等基准测试中达到最优性能。
English: TW-GRPO enhances multimodal reasoning by introducing token weighting to focus on key information and dense rewards for partial correctness, achieving state-of-the-art results on benchmarks like CLEVRER and MMVU.

Authors:Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey
Title: PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations
Abstract:
We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.
中文: PDE-Transformer是一种改进的Transformer架构,在规则网格上模拟多种偏微分方程时优于现有最佳模型,并通过预训练在下游任务中表现出色。
English: PDE-Transformer is an enhanced transformer architecture that outperforms state-of-the-art models in simulating various PDEs on regular grids and excels in downstream tasks through pre-training.

Authors:Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo
Title: FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
中文: 针对多模态大语言模型在金融领域缺乏专业评估数据集的问题,我们开发了包含逾1.1万样本的FinMME数据集及FinScore评估体系,实验表明即使GPT-4o等顶尖模型在该高鲁棒性基准上也表现不佳。
English: Multimodal Large Language Models (MLLMs) lack specialized financial evaluation datasets, prompting the creation of FinMME with over 11,000 samples and FinScore for unbiased assessment, revealing even top models like GPT-4o struggle on this robust benchmark.

Authors:Raman Jha, Adithya Lenka, Mani Ramanagopal, Aswin Sankaranarayanan, Kaushik Mitra
Title: RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement
Abstract:
In nighttime conditions, high noise levels and bright illumination sources degrade image quality, making low-light image enhancement challenging. Thermal images provide complementary information, offering richer textures and structural details. We propose RT-X Net, a cross-attention network that fuses RGB and thermal images for nighttime image enhancement. We leverage self-attention networks for feature extraction and a cross-attention mechanism for fusion to effectively integrate information from both modalities. To support research in this domain, we introduce the Visible-Thermal Image Enhancement Evaluation (V-TIEE) dataset, comprising 50 co-located visible and thermal images captured under diverse nighttime conditions. Extensive evaluations on the publicly available LLVIP dataset and our V-TIEE dataset demonstrate that RT-X Net outperforms state-of-the-art methods in low-light image enhancement. The code and the V-TIEE can be found here https://github.com/jhakrraman/rt-xnet.
中文摘要:RT-X Net通过交叉注意力网络融合RGB与热成像数据来增强夜间图像质量,在公开LLVIP数据集及新构建的V-TIEE数据集上均优于现有先进方法。
English Summary: RT-X Net is a cross-attention network that enhances nighttime images by fusing RGB and thermal data, outperforming existing methods on both the LLVIP and newly introduced V-TIEE datasets.

Authors:Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz
Title: Conformal Prediction for Zero-Shot Models
Abstract:
Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.
Chinese: 本研究针对大规模视觉语言模型在可靠性方面的不足,提出Conf-OT方法,通过最优传输理论桥接领域差异,在15个数据集上实现最高20%的集合效率提升,运算速度比现有方法快15倍,同时保持理论覆盖保证。
English: Large-scale vision-language models exhibit strong adaptability but face reliability challenges, which this study addresses by proposing Conf-OT, a conformal prediction method that enhances set efficiency by up to 20% and speeds up processing 15-fold while maintaining theoretical guarantees across diverse datasets.

Authors:Sander Land, Catherine Arnett
Title: BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Abstract:
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
中文: SCRIPT-BPE提出基于Unicode脚本类别的编码方案,通过遵循文字边界的预分词规则取代脆弱的正则表达式方法,并在BPE合并时保持字符完整性,在维持压缩率的同时消除了非拉丁文字的编码惩罚。
English: SCRIPT-BPE introduces a Unicode-based encoding scheme that replaces fragile regex pretokenization with script-boundary rules and enforces character integrity during BPE merging, eliminating encoding penalties for non-Latin scripts while maintaining competitive compression.

Authors:Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui
Title: Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
Abstract:
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution. The code is released at https://github.com/alickzhu/Soft-Reasoning.
中文摘要:Soft Reasoning是一种基于嵌入的搜索框架,通过优化初始标记嵌入并结合受控探索与贝叶斯优化,有效提升大语言模型在复杂推理中的准确性和可扩展性,同时显著降低计算成本。
English Summary: Soft Reasoning is an embedding-based search framework that enhances complex reasoning in LLMs by optimizing initial token embeddings through controlled exploration and Bayesian optimization, improving accuracy and scalability with minimal computation.

Authors:Jiahe Chen, Jiahe Ying, Shen Wang, Jianwei Zheng
Title: Decoupled Competitive Framework for Semi-supervised Medical Image Segmentation
Abstract:
Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limitations, \textit{e.g.}, the over-coupling problem within MT structure owing to the employment of exponential moving average (EMA) mechanism, as well as the severe cognitive bias between two students of DS structure, both of which potentially lead to reduced efficacy, or even model collapse eventually. To mitigate these issues, a Decoupled Competitive Framework (DCF) is elaborated in this work, which utilizes a straightforward competition mechanism for the update of EMA, effectively decoupling students and teachers in a dynamical manner. In addition, the seamless exchange of invaluable and precise insights is facilitated among students, guaranteeing a better learning paradigm. The DCF introduced undergoes rigorous validation on three publicly accessible datasets, which encompass both 2D and 3D datasets. The results demonstrate the superiority of our method over previous cutting-edge competitors. Code will be available at https://github.com/JiaheChen2002/DCF.
中文: 针对半监督医学图像分割中因均值教师结构过耦合和双学生结构认知偏差导致的性能瓶颈,本研究提出解耦竞争框架(DCF),通过竞争机制动态解耦师生关系并促进知识无缝交换,在多个数据集上取得优越性能。
English: To overcome performance bottlenecks in semi-supervised medical image segmentation caused by over-coupling in Mean Teacher and cognitive bias in Dual Students, this study proposes a Decoupled Competitive Framework (DCF) that dynamically decouples students and teachers through a competition mechanism and enables seamless knowledge exchange, achieving superior results on multiple datasets.

Authors:Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu
Title: PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder
Abstract:
Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at https://github.com/dukesun99/ACL-PRISM.
中文: PRISM是首个可生成可解释政治偏见嵌入的框架,通过从新闻数据中提取细粒度偏见指标并为文章分配结构化偏见分数,在政治偏见分类任务中优于现有模型,同时提供便于意识形态分析的透明表征。
English: PRISM is a novel framework that generates interpretable political bias embeddings by extracting bias indicators from news data and assigning structured bias scores, outperforming existing models in political bias classification while providing transparent representations for ideological analysis.

Authors:Masahiro Negishi, Thomas Gärtner, Pascal Welke
Title: WILTing Trees: Interpreting the Distance Between MPNN Embeddings
Abstract:
We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture the functional distance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks to structural distances on graphs that ignore task-specific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.
Chinese: 本研究探讨了消息传递神经网络如何学习预测目标间的功能距离,通过基于Weisfeiler Leman标记树的最优传输方法将其提炼为可解释的图距离,从而高效识别关键子图结构。
English: This study explores how message passing neural networks (MPNNs) learn functional distances between prediction targets, distilling these into an interpretable graph distance using optimal transport on the Weisfeiler Leman Labeling Tree to identify influential subgraphs efficiently.

Authors:Xuzhi Wang, Wei Feng, Lingdong Kong, Liang Wan
Title: NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation
Abstract:
LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel-based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel-based methods remains in two aspects: (1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. (2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non-uniform cylindrical partition network named NUC-Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non-uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non-uniform multi-scale aggregation method to improve contextual information. Our method achieves state-of-the-art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by $4 \times$ training faster and $2 \times$ GPU memory reduction and $3 \times$ inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at \href{https://github.com/alanWXZ/NUC-Net}{https://github.com/alanWXZ/NUC-Net}.
中文: 提出的NUC-Net采用非均匀柱面分区和多尺度聚合方法,有效解决了现有激光雷达语义分割方法计算效率低和点云分布不均的问题,在保持顶尖性能的同时大幅提升了速度和内存效率。
English: The proposed NUC-Net introduces non-uniform cylindrical partitioning and multi-scale aggregation to overcome the computational inefficiency and unbalanced point distribution issues in existing LiDAR semantic segmentation methods, achieving state-of-the-art performance with significant speed and memory improvements.

Authors:Ivan Pereira-Sánchez, Julia Navarro, Ana Belén Petro, Joan Duran
Title: Model-Guided Network with Cluster-Based Operators for Spatio-Spectral Super-Resolution
Abstract:
This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach. The source code will be available at https://github.com/TAMI-UIB/JSSUNet
中文: 本文提出了一种模型驱动的联合空谱超分辨率框架,通过将问题分解为空间超分辨率、光谱超分辨率和融合子任务,采用可学习模块和非局部后处理步骤,在多个数据集和缩放因子下实现了优越性能。
English: This paper proposes a model-driven framework for joint spatio-spectral super-resolution that decomposes the problem into spatial super-resolution, spectral super-resolution, and fusion tasks, utilizing learnable modules and a nonlocal post-processing step to achieve superior performance across various datasets and scaling factors.

Authors:Qinghe Ma, Jian Zhang, Lei Qi, Qian Yu, Yinghuan Shi, Yang Gao
Title: Unleashing the Power of Intermediate Domains for Mixed Domain Semi-Supervised Medical Image Segmentation
Abstract:
Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However, the coexistence of limited annotation and domain shift is quite common, which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS), where limited labeled data from a single domain and a large amount of unlabeled data from multiple domains. To tackle this issue, we propose the UST-RUN framework, which fully leverages intermediate domain information to facilitate knowledge transfer. We employ Unified Copy-paste (UCP) to construct intermediate domains, and propose a Symmetric GuiDance training strategy (SymGD) to supervise unlabeled data by merging pseudo-labels from intermediate samples. Subsequently, we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. To generate more diverse intermediate samples, we further select reliable samples with high-quality pseudo-labels, which are then mixed with other unlabeled data. Additionally, we generate sophisticated intermediate samples with high-quality pseudo-labels for unreliable samples, ensuring effective knowledge transfer for them. Extensive experiments on four public datasets demonstrate the superiority of UST-RUN. Notably, UST-RUN achieves a 12.94% improvement in Dice score on the Prostate dataset. Our code is available at https://github.com/MQinghe/UST-RUN
中文: 本研究针对医学图像分割中标注有限和领域偏移并存的问题,提出了MiDSS场景和UST-RUN框架,通过统一复制粘贴构建中间域、对称引导训练和训练过程感知的随机幅度混合等创新方法,有效促进知识迁移,在前列腺数据集上实现了12.94%的Dice指标提升。
English: The study introduces the MiDSS scenario to address both limited annotations and domain shifts in medical image segmentation, proposing the UST-RUN framework that utilizes intermediate domains and innovative strategies like UCP, SymGD, and TP-RAM to enhance knowledge transfer and achieve significant performance gains, such as a 12.94% Dice improvement on the Prostate dataset.

Authors:Simone Cammarasana, Giuseppe Patanè
Title: Optimal Weighted Convolution for Classification and Denosing
Abstract:
We introduce a novel weighted convolution operator that enhances traditional convolutional neural networks (CNNs) by integrating a spatial density function into the convolution operator. This extension enables the network to differentially weight neighbouring pixels based on their relative position to the reference pixel, improving spatial characterisation and feature extraction. The proposed operator maintains the same number of trainable parameters and is fully compatible with existing CNN architectures. Although developed for 2D image data, the framework is generalisable to signals on regular grids of arbitrary dimensions, such as 3D volumetric data or 1D time series. We propose an efficient implementation of the weighted convolution by pre-computing the density function and achieving execution times comparable to standard convolution layers. We evaluate our method on two deep learning tasks: image classification using the CIFAR-100 dataset [KH+09] and image denoising using the DIV2K dataset [AT17]. Experimental results with state-of-the-art classification (e.g., VGG [SZ15], ResNet [HZRS16]) and denoising (e.g., DnCNN [ZZC+17], NAFNet [CCZS22]) methods show that the weighted convolution improves performance with respect to standard convolution across different quantitative metrics. For example, VGG achieves an accuracy of 66.94% with weighted convolution versus 56.89% with standard convolution on the classification problem, while DnCNN improves the PSNR value from 20.17 to 22.63 on the denoising problem. All models were trained on the CINECA Leonardo cluster to reduce the execution time and improve the tuning of the density function values. The PyTorch implementation of the weighted convolution is publicly available at: https://github.com/cammarasana123/weightedConvolution2.0.
中文: 本文提出了一种新型加权卷积算子,通过引入空间密度函数对邻域像素进行差异化加权,在保持参数数量和架构兼容性的同时,有效提升了图像分类与去噪任务的性能表现。
English: This paper introduces a novel weighted convolution operator that enhances CNNs by incorporating a spatial density function to differentially weight pixels, improving performance in image classification and denoising tasks while maintaining parameter efficiency and architectural compatibility.

Authors:Xiaoang Xu, Shuo Wang, Xu Han, Zhenghao Liu, Huijia Wu, Peipei Li, Zhiyuan Liu, Maosong Sun, Zhaofeng He
Title: A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings
Abstract:
Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2.39$\times$ with low-budget and reduce the length of the output token by nearly 50% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: https://github.com/AI9Stars/AStar-Thought.
中文: A*-Thought 是一种高效的树搜索框架,通过识别大型推理模型中的关键思路并利用双向估计和A*搜索压缩推理链,在性能与效率之间实现平衡。
English: A*-Thought is an efficient tree search framework that compresses reasoning chains in Large Reasoning Models by identifying essential thoughts, balancing performance and efficiency through bidirectional estimation and A* search.

Authors:Chaohui Xu, Qi Cui, Chip-Hong Chang
Title: CHIP: Chameleon Hash-based Irreversible Passport for Robust Deep Model Ownership Verification and Active Usage Control
Abstract:
The pervasion of large-scale Deep Neural Networks (DNNs) and their enormous training costs make their intellectual property (IP) protection of paramount importance. Recently introduced passport-based methods attempt to steer DNN watermarking towards strengthening ownership verification against ambiguity attacks by modulating the affine parameters of normalization layers. Unfortunately, neither watermarking nor passport-based methods provide a holistic protection with robust ownership proof, high fidelity, active usage authorization and user traceability for offline access distributed models and multi-user Machine-Learning as a Service (MLaaS) cloud model. In this paper, we propose a Chameleon Hash-based Irreversible Passport (CHIP) protection framework that utilizes the cryptographic chameleon hash function to achieve all these goals. The collision-resistant property of chameleon hash allows for strong model ownership claim upon IP infringement and liable user traceability, while the trapdoor-collision property enables hashing of multiple user passports and licensee certificates to the same immutable signature to realize active usage control. Using the owner passport as an oracle, multiple user-specific triplets, each contains a passport-aware user model, a user passport, and a licensee certificate can be created for secure offline distribution. The watermarked master model can also be deployed for MLaaS with usage permission verifiable by the provision of any trapdoor-colliding user passports. CHIP is extensively evaluated on four datasets and two architectures to demonstrate its protection versatility and robustness. Our code is released at https://github.com/Dshm212/CHIP.
中文: 提出的CHIP框架利用变色龙哈希函数,为离线和云端模型提供全面的深度神经网络保护,包括强健的所有权验证、用户可追溯性和主动使用控制。
English: The proposed CHIP framework uses chameleon hash functions to provide comprehensive DNN protection, including robust ownership verification, user traceability, and active usage control for both offline and cloud-based models.

Authors:Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
Title: un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Abstract:
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un$^2$CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un$^2$CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.
中文: 本研究提出un$^2$CLIP改进模型,通过反演unCLIP生成框架增强视觉细节捕捉能力,同时保持图文对齐特性,在多项视觉与多模态任务中显著提升原CLIP模型性能。
English: This work proposes un$^2$CLIP, an improved CLIP model that leverages the inversion of unCLIP's generative framework to enhance visual detail capture while maintaining text-image alignment, achieving superior performance across multiple vision and multimodal tasks.

Authors:Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia
Title: Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection
Abstract:
The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: https://github.com/falihgoz/RAIS.
中文: 提出的RAIS方法通过辅助标签选择多样化样本进行排练,提升了音频深度伪造检测性能,以1.953%的平均等错误率超越现有技术。
English: The proposed RAIS method enhances audio deepfake detection by using auxiliary labels to select diverse samples for rehearsal, outperforming existing techniques with a 1.953% average EER.

Authors:Jing Huang, Yongkang Zhao, Yuhan Li, Zhitao Dai, Cheng Chen, Qiying Lai
Title: ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation
Abstract:
The U-shaped encoder-decoder architecture with skip connections has become a prevailing paradigm in medical image segmentation due to its simplicity and effectiveness. While many recent works aim to improve this framework by designing more powerful encoders and decoders, employing advanced convolutional neural networks (CNNs) for local feature extraction, Transformers or state space models (SSMs) such as Mamba for global context modeling, or hybrid combinations of both, these methods often struggle to fully utilize pretrained vision backbones (e.g., ResNet, ViT, VMamba) due to structural mismatches. To bridge this gap, we introduce ACM-UNet, a general-purpose segmentation framework that retains a simple UNet-like design while effectively incorporating pretrained CNNs and Mamba models through a lightweight adapter mechanism. This adapter resolves architectural incompatibilities and enables the model to harness the complementary strengths of CNNs and SSMs-namely, fine-grained local detail extraction and long-range dependency modeling. Additionally, we propose a hierarchical multi-scale wavelet transform module in the decoder to enhance feature fusion and reconstruction fidelity. Extensive experiments on the Synapse and ACDC benchmarks demonstrate that ACM-UNet achieves state-of-the-art performance while remaining computationally efficient. Notably, it reaches 85.12% Dice Score and 13.89mm HD95 on the Synapse dataset with 17.93G FLOPs, showcasing its effectiveness and scalability. Code is available at: https://github.com/zyklcode/ACM-UNet.
中文: ACM-UNet通过轻量级适配器有效整合预训练的CNN和Mamba模型,在保持计算效率的同时实现了最先进的医学图像分割性能。
English: ACM-UNet introduces a lightweight adapter to effectively integrate pretrained CNNs and Mamba models, achieving state-of-the-art medical image segmentation performance while maintaining computational efficiency.

Authors:Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Towards Effective Code-Integrated Reasoning
Abstract:
In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.
中文摘要:本文提出了一种系统性方法,通过平衡探索与稳定性的增强训练策略,有效提升了代码集成推理中工具增强强化学习的训练效果,在多个数学推理基准测试中实现了显著性能提升,并揭示了代码集成扩展模型能力边界的关键机制。
English Summary: This paper introduces a systematic approach to enhance the training stability and effectiveness of tool-augmented reinforcement learning for code-integrated reasoning, demonstrating significant performance gains across mathematical benchmarks through improved strategies that balance exploration and capability development.

Authors:Yuting Zhang, Hao Lu, Qingyong Hu, Yin Wang, Kaishen Yuan, Xin Liu, Kaishun Wu
Title: Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Abstract:
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an "Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a "Resisting Logical Oblivion" optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at https://github.com/keke-nice/Period-LLM.
中文摘要:本文提出Period-LLM多模态大语言模型,通过采用“由易到难泛化”范式和“抵抗逻辑遗忘”优化策略,有效解决了现有模型在周期性任务中的时序建模与长短周期冲突问题,在多模态周期性任务中展现出卓越性能。
English Summary: This paper introduces Period-LLM, a multimodal large language model designed to overcome current limitations in handling periodic tasks by employing an "Easy to Hard Generalization" paradigm and a "Resisting Logical Oblivion" strategy, demonstrating superior performance across various modalities.

Authors:Jingyao Li, Senqiao Yang, Sitong Wu, Han Shi, Chuanyang Zheng, Hong Xu, Jiaya Jia
Title: Logits-Based Finetuning
Abstract:
In recent years, developing compact and efficient large language models (LLMs) has emerged as a thriving area of research. Traditional Supervised Fine-Tuning (SFT), which relies on singular ground truth labels, often fails to capture token-level dependencies and linguistic diversity. To address these limitations, we propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation. Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity. This ensures more reliable and effective training. We constructed a large-scale 1.2M logits dataset and trained a series of science-focused models. Experimental results demonstrate that our method achieves significant improvements, with accuracy gains of 18% on Mawps and 22.7% on TabMWP. Across nine widely used mathematical benchmarks, our method consistently outperforms prior SFT models, achieving an average improvement of 7.28%. Codes are available at https://github.com/dvlab-research/Logits-Based-Finetuning.
中文: 本研究提出了一种基于逻辑值的微调框架,将监督学习与知识蒸馏相结合,利用教师逻辑值和真实标签优化训练,在科学领域基准测试中取得了显著的准确率提升。
English: This study introduces a logits-based fine-tuning framework that merges supervised learning with knowledge distillation, using teacher logits and ground truth labels to enhance training, resulting in significant accuracy gains on science-focused benchmarks.

Authors:Tianlong Yu, Chenghang Ye, Zheyu Yang, Ziyi Zhou, Cui Tang, Zui Tao, Jun Zhang, Kailong Wang, Liting Zhou, Yang Yang, Ting Bi
Title: SEAR: A Multimodal Dataset for Analyzing AR-LLM-Driven Social Engineering Behaviors
Abstract:
The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR's alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at https://github.com/INSLabCN/SEAR-Dataset.
中文: SEAR数据集是一个多模态资源,收录了180段模拟社交工程攻击场景的标注对话,揭示了攻击的高效性(如93.3%钓鱼链接点击率),为检测增强现实驱动的威胁提供研究基础,同时通过匿名化等伦理措施确保合规使用。
English: The SEAR Dataset is a multimodal resource capturing 180 annotated conversations in simulated social engineering scenarios, demonstrating alarming attack efficacy with high compliance rates and providing data for detecting AR-driven threats while ensuring ethical safeguards.

Authors:Heejo Kong, Sung-Jin Kim, Gunho Jung, Seong-Whan Lee
Title: Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers
Abstract:
Conventional semi-supervised learning (SSL) ideally assumes that labeled and unlabeled data share an identical class distribution, however in practice, this assumption is easily violated, as unlabeled data often includes unknown class data, i.e., outliers. The outliers are treated as noise, considerably degrading the performance of SSL models. To address this drawback, we propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness in the context of open-set semi-supervised learning. In particular, we note that existing open-set SSL methods rely on prediction discrepancies between inliers and outliers from a single model trained on labeled data. This approach can be easily failed when the labeled data is insufficient, leading to performance degradation that is worse than naive SSL that do not account for outliers. In contrast, our approach exploits prediction disagreements among multiple models that are differently biased towards the unlabeled distribution. By leveraging the discrepancies arising from training on unlabeled data, our method enables robust outlier detection even when the labeled data is underspecified. Our key contribution is constructing a collection of differently biased models through a single training process. By encouraging divergent heads to be differently biased towards outliers while making consistent predictions for inliers, we exploit the disagreement among these heads as a measure to identify unknown concepts. Our code is available at https://github.com/heejokong/DivCon.
中文: 提出的“多样化与征服”(DAC)框架通过训练多个具有不同偏差的模型,利用其预测差异来检测异常数据,有效提升了半监督学习在开放集场景下的鲁棒性,克服了现有方法在标注数据不足时性能下降的问题。
English: The proposed Diversify and Conquer (DAC) framework enhances semi-supervised learning robustness by training multiple divergent models that detect outliers through their prediction disagreements, overcoming limitations of existing methods that fail with insufficient labeled data.

Authors:Zhentao Xie, Chengcheng Han, Jinxin Shi, Wenjun Cui, Xin Zhao, Xingjiao Wu, Jiabao Zhao
Title: RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation
Abstract:
Although multi-agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet's residual learning, we propose Residual Mixture-of-Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding-based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross-layer incremental information by capturing inter-layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state-of-the-art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at https://github.com/mindhunter01/RMoA.
中文摘要:提出的残差智能体混合框架通过引入残差连接和多样性选择机制,在提升多任务性能的同时显著降低了计算开销,实现了效率与可靠性的优化。
English Summary: The proposed Residual Mixture-of-Agents (RMoA) framework enhances multi-agent systems by incorporating residual connections and diversity selection to boost efficiency and performance across various tasks while reducing computational costs.

Authors:Chunxu Liu, Chi Xie, Xiaxu Chen, Wei Li, Feng Zhu, Rui Zhao, Limin Wang
Title: SORCE: Small Object Retrieval in Complex Environments
Abstract:
Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a given textual query to images in a gallery. Existing benchmarks primarily focus on textual queries describing overall image semantics or foreground salient objects, possibly overlooking inconspicuous small objects, especially in complex environments. Such small object retrieval is crucial, as in real-world applications, the targets of interest are not always prominent in the image. Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries. We propose a new benchmark, SORCE-1K, consisting of images with complex environments and textual queries describing less conspicuous small objects with minimal contextual cues from other salient objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we propose to represent each image with multiple distinctive embeddings. We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP). Experimental results show that our multi-embedding approach through MLLM and ReP significantly outperforms existing T2IR methods on SORCE-1K. Our experiments validate the effectiveness of SORCE-1K for benchmarking SORCE performances, highlighting the potential of multi-embedding representation and text-customized MLLM features for addressing this task.
中文: 本研究提出了专注于复杂环境中小物体检索的SORCE新子领域,并采用多模态大语言模型和区域提示的多嵌入方法,在SORCE-1K基准测试中显著优于现有技术。
English: The study introduces SORCE, a new subfield of Text-to-Image Retrieval focusing on small objects in complex environments, and proposes a multi-embedding approach using MLLMs and Regional Prompts that significantly outperforms existing methods on the SORCE-1K benchmark.

Authors:Bozhong Zheng, Jinye Gan, Xiaohao Xu, Wenqiao Li, Xiaonan Huang, Na Ni, Yingna Wu
Title: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
Abstract:
3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at https://github.com/ZZZBBBZZZ/PASDF to support further research.
中文摘要:PASDF框架通过引入姿态不变的连续有向距离场表示,实现了精确的三维异常检测与原位修复,在基准数据集上取得了领先性能。
English Summary: The PASDF framework introduces a pose-invariant continuous shape representation using signed distance fields to achieve precise 3D anomaly detection and in-situ repair, demonstrating state-of-the-art performance on benchmark datasets.

Authors:Zhiwei Liu, Lingfei Qian, Qianqian Xie, Jimin Huang, Kailai Yang, Sophia Ananiadou
Title: MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs
Abstract:
Large language models and vision-language models (which we jointly call LMs) have transformed NLP and CV, demonstrating remarkable potential across various fields. However, their capabilities in affective analysis (i.e. sentiment analysis and emotion detection) remain underexplored. This gap is largely due to the absence of comprehensive evaluation benchmarks, and the inherent complexity of affective analysis tasks. In this paper, we introduce MMAFFBen, the first extensive open-source benchmark for multilingual multimodal affective analysis. MMAFFBen encompasses text, image, and video modalities across 35 languages, covering four key affective analysis tasks: sentiment polarity, sentiment intensity, emotion classification, and emotion intensity. Moreover, we construct the MMAFFIn dataset for fine-tuning LMs on affective analysis tasks, and further develop MMAFFLM-3b and MMAFFLM-7b based on it. We evaluate various representative LMs, including GPT-4o-mini, providing a systematic comparison of their affective understanding capabilities. This project is available at https://github.com/lzw108/MMAFFBen.
中文:本文提出了首个全面的多语言多模态情感分析基准MMAFFBen,并开发了专门模型来系统评估语言模型在不同模态和语言中的情感理解能力。
English: This paper introduces MMAFFBen, the first comprehensive multilingual multimodal benchmark for affective analysis, and develops specialized models to systematically evaluate language models' capabilities in sentiment and emotion tasks across diverse modalities and languages.

Authors:Wenlong Jiao, Binglong Li, Wei Shang, Ping Wang, Dongwei Ren
Title: Efficient RAW Image Deblurring with Adaptive Frequency Modulation
Abstract:
Image deblurring plays a crucial role in enhancing visual clarity across various applications. Although most deep learning approaches primarily focus on sRGB images, which inherently lose critical information during the image signal processing pipeline, RAW images, being unprocessed and linear, possess superior restoration potential but remain underexplored. Deblurring RAW images presents unique challenges, particularly in handling frequency-dependent blur while maintaining computational efficiency. To address these issues, we propose Frequency Enhanced Network (FrENet), a framework specifically designed for RAW-to-RAW deblurring that operates directly in the frequency domain. We introduce a novel Adaptive Frequency Positional Modulation module, which dynamically adjusts frequency components according to their spectral positions, thereby enabling precise control over the deblurring process. Additionally, frequency domain skip connections are adopted to further preserve high-frequency details. Experimental results demonstrate that FrENet surpasses state-of-the-art deblurring methods in RAW image deblurring, achieving significantly better restoration quality while maintaining high efficiency in terms of reduced MACs. Furthermore, FrENet's adaptability enables it to be extended to sRGB images, where it delivers comparable or superior performance compared to methods specifically designed for sRGB data. The code will be available at https://github.com/WenlongJiao/FrENet .
中文: 提出的频率增强网络(FrENet)通过在频域直接处理RAW图像去模糊问题,以高效计算实现了卓越的恢复质量,并展现出对sRGB图像的适配能力。
English: The proposed Frequency Enhanced Network (FrENet) effectively addresses RAW image deblurring by operating directly in the frequency domain, achieving superior restoration quality with high computational efficiency and demonstrating adaptability to sRGB images.

Authors:Kanokphan Lertniphonphan, Feng Chen, Junda Xu, Fengbu Lan, Jun Xie, Tao Zhang, Zhepeng Wang
Title: PCIE_Interaction Solution for Ego4D Social Interaction Challenge
Abstract:
This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025, addressing both Looking At Me (LAM) and Talking To Me (TTM) tasks. The challenge requires accurate detection of social interactions between subjects and the camera wearer, with LAM relying exclusively on face crop sequences and TTM combining speaker face crops with synchronized audio segments. In the LAM track, we employ face quality enhancement and ensemble methods. For the TTM task, we extend visual interaction analysis by fusing audio and visual cues, weighted by a visual quality score. Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board. Code is available at https://github.com/KanokphanL/PCIE_Ego4D_Social_Interaction
中文: 我们针对CVPR 2025 Ego4D社交互动挑战赛提出的PCIE_Interaction方案,通过增强面部分析处理"注视检测"任务,融合视听信息解决"对话检测"任务,分别取得了0.81和0.71的平均精度值。
English: Our PCIE_Interaction solution for CVPR 2025's Ego4D Social Interaction Challenge uses enhanced face analysis for the Looking At Me task and audio-visual fusion for Talking To Me, achieving 0.81 and 0.71 mAP respectively.

Authors:Xianheng Ma, Hongchen Tan, Xiuping Liu, Yi Zhang, Huasheng Wang, Jiang Liu, Ying Chen, Hantao Liu
Title: S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification
Abstract:
In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:https://github.com/Mhsunshine/SC3E_Net.
中文: 本文提出S3CE-Net模型,基于脉冲神经网络处理事件相机数据,通过脉冲引导的时空注意力机制和特征采样策略,以低参数量实现高效的长序列行人重识别,并在多个主流数据集上表现出色。
English: This paper introduces S3CE-Net, a spiking neural network-based model that uses event cameras for efficient long-sequence person re-identification by incorporating spike-guided attention and spatiotemporal feature sampling to achieve robust performance with minimal parameters.

Authors:Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu
Title: Rethinking Exact Unlearning under Exposure: Extracting Forgotten Data under Exact Unlearning in Large Language Model
Abstract:
Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.
中文: 精确遗忘虽被视为隐私保护的黄金标准,却可能因攻击者利用遗忘前模型信号来恢复已删除数据分布而矛盾地增加隐私风险,这在多个基准测试中得到了验证。
English: Exact unlearning, while considered the gold standard for privacy protection, may paradoxically increase privacy risks by enabling data extraction attacks that leverage pre-unlearning model signals to recover removed data distributions, as demonstrated across multiple benchmarks.

Authors:Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Abstract:
Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. The code and models will be available at: https://github.com/XIANGLONGYAN/ReCalKV.
中文: ReCalKV提出了一种新颖的KV缓存低秩压缩方法,通过键向量的头部相似性重排序和值向量的离线校准策略,在保持高压缩比的同时显著优于现有方法且性能损失最小。
English: ReCalKV introduces a novel low-rank compression method for KV cache by employing head-wise similarity reordering for Keys and offline value calibration for Values, significantly outperforming existing techniques with minimal performance loss at high compression ratios.

Authors:Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Abstract:
Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step toward efficient long-context inference. Recent methods have explored low-rank techniques to reduce the hidden size of the KV cache. However, they neglect the distinct roles and varying importance of Keys and Values, leading to significant performance drops under high compression. To address this, we propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values. For Keys, we propose Head-wise Similarity aware Reordering (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation via grouped SVD. For Values, we propose Offline Value Calibration (OVC), which efficiently calibrates the value projection matrix using calibration data without training, ensuring an accurate representation of contextual information. Extensive experiments show that ReCalKV consistently outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. The code and models will be available at:https://github.com/XIANGLONGYAN/ReCalKV.
中文: ReCalKV提出了一种新颖的KV缓存低秩压缩方法,通过键向量的头部相似性重排序和值向量的离线校准策略,在保持高压缩比的同时显著优于现有方法且性能损失最小。
English: ReCalKV introduces a novel low-rank compression method for KV cache by employing head-wise similarity reordering for Keys and offline value calibration for Values, significantly outperforming existing techniques with minimal performance loss at high compression ratios.

Authors:Gilles Quentin Hacheme, Girmaw Abebe Tadesse, Caleb Robinson, Akram Zaytar, Rahul Dodhia, Juan M. Lavista Ferres
Title: GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
Abstract:
Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at https://github.com/microsoft/geo-vision-labeler to catalyze adoption in real-world geospatial workflows.
Chinese: GeoVision标注器(GVL)是一种严格零样本分类框架,通过视觉大语言模型生成图像描述并由常规大语言模型将其映射至用户定义类别,无需任务特定训练即可实现高精度地理空间影像分类。
English: The GeoVision Labeler (GVL) is a strictly zero-shot classification framework that uses a vision Large Language Model to generate image descriptions and a conventional LLM to map them to user-defined classes, achieving high accuracy in geospatial imagery classification without task-specific training.

Authors:Uzair Khan, Franco Fummi, Luigi Capogrosso
Title: KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices
Abstract:
In the era of intelligent manufacturing, anomaly detection has become essential for maintaining quality control on modern production lines. However, while many existing models show promising performance, they are often too large, computationally demanding, and impractical to deploy on resource-constrained embedded devices that can be easily installed on the production lines of Small and Medium Enterprises (SMEs). To bridge this gap, we present KairosAD, a novel supervised approach that uses the power of the Mobile Segment Anything Model (MobileSAM) for image-based anomaly detection. KairosAD has been evaluated on the two well-known industrial anomaly detection datasets, i.e., MVTec-AD and ViSA. The results show that KairosAD requires 78% fewer parameters and boasts a 4x faster inference time compared to the leading state-of-the-art model, while maintaining comparable AUROC performance. We deployed KairosAD on two embedded devices, the NVIDIA Jetson NX, and the NVIDIA Jetson AGX. Finally, KairosAD was successfully installed and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at https://github.com/intelligolabs/KairosAD.
中文摘要:KairosAD是一种基于MobileSAM的轻量级监督异常检测模型,在保持与顶尖模型相当的AUROC性能的同时,参数减少78%、推理速度提升4倍,可部署于中小企业生产线的嵌入式设备。
English Summary: KairosAD is a lightweight supervised anomaly detection model that leverages MobileSAM to achieve comparable performance with 78% fewer parameters and 4x faster inference than leading models, making it suitable for deployment on resource-constrained embedded devices in SMEs.

Authors:Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu
Title: DisTime: Distribution-based Time Representation for Video Large Language Models
Abstract:
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.
中文: DisTime通过引入连续时间嵌入和基于分布的编解码器增强视频大语言模型的时间定位能力,并借助大规模InternVid-TG数据集,在时间敏感任务中取得了最优性能。
English: DisTime enhances Video-LLMs' temporal localization by introducing continuous temporal embeddings and a distribution-based decoder, supported by the large-scale InternVid-TG dataset, achieving state-of-the-art results in time-sensitive tasks.

Authors:Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, Tianbo Ji
Title: GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments
Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated their potential in planning and reasoning tasks, offering a flexible alternative to classical pathfinding algorithms. However, most existing studies focus on LLMs' independent reasoning capabilities and overlook the potential synergy between LLMs and traditional algorithms. To fill this gap, we propose a comprehensive evaluation benchmark GridRoute to assess how LLMs can take advantage of traditional algorithms. We also propose a novel hybrid prompting technique called Algorithm of Thought (AoT), which introduces traditional algorithms' guidance into prompting. Our benchmark evaluates six LLMs ranging from 7B to 72B parameters across various map sizes, assessing their performance in correctness, optimality, and efficiency in grid environments with varying sizes. Our results show that AoT significantly boosts performance across all model sizes, particularly in larger or more complex environments, suggesting a promising approach to addressing path planning challenges. Our code is open-sourced at https://github.com/LinChance/GridRoute.
中文: 本研究提出了GridRoute基准,用于评估大语言模型与传统算法在路径规划中的协同作用,并开发了思维算法(AoT)混合提示技术,显著提升了不同复杂度环境下大语言模型的性能表现。
English: The study introduces GridRoute, a benchmark for evaluating the synergy between Large Language Models and traditional algorithms in path planning, and proposes Algorithm of Thought (AoT), a hybrid prompting technique that significantly enhances LLM performance across various complexities.

Authors:Enshang Zhang, Zhicheng Zhang, Takashi Hanakawa
Title: Category-aware EEG image generation based on wavelet transform and contrast semantic loss
Abstract:
Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43\%, outperforming other state-of-the-art methods. The source code and supplementary material is available at https://github.com/zes0v0inn/DWT_EEG_Reconstruction/tree/main.
中文摘要:本文提出一种结合离散小波变换和门控机制的基于Transformer的脑电信号编码器,通过预训练扩散模型从脑电信号重建视觉刺激,在THINGS-EEG数据集上实现了最佳的语义对齐效果和高达43%的分类准确率。
English Summary: This paper introduces a transformer-based EEG encoder that integrates Discrete Wavelet Transform and gating mechanisms to reconstruct visual stimuli from brain signals using a diffusion model, achieving state-of-the-art performance in semantic alignment and classification accuracy on the THINGS-EEG dataset.

Authors:Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
Title: AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Abstract:
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Chinese: AReaL是一种异步强化学习系统,通过解耦生成与训练来提高GPU利用率并加速大语言模型训练,在推理任务上实现了最高2.77倍的加速,同时保持或提升了性能。
English: AReaL is an asynchronous reinforcement learning system that decouples generation from training to enhance GPU utilization and accelerate LLM training, achieving up to 2.77× speedup with stable or improved performance on reasoning tasks.

Authors:James R. Golden
Title: Large Language Models are Locally Linear Mappings
Abstract:
We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
中文: 研究表明,多种大型语言模型的推理过程无需修改权重即可精确映射为线性系统,通过奇异值分解发现这些模型在低维空间中运行,其中主要向量对应预测词汇相关的语义概念。
English: This study shows that the inference processes of various large language models can be accurately represented as linear systems without changing model weights, revealing through singular value decomposition that these models operate in low-dimensional spaces where dominant vectors correspond to semantic concepts related to predicted tokens.

Authors:Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Hengrui Zhang, Zhongfen Deng, Philip S. Yu
Title: MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection
Abstract:
We introduce MUSE, a watermarking algorithm for tabular generative models. Previous approaches typically leverage DDIM invertibility to watermark tabular diffusion models, but tabular diffusion models exhibit significantly poorer invertibility compared to other modalities, compromising performance. Simultaneously, tabular diffusion models require substantially less computation than other modalities, enabling a multi-sample selection approach to tabular generative model watermarking. MUSE embeds watermarks by generating multiple candidate samples and selecting one based on a specialized scoring function, without relying on model invertibility. Our theoretical analysis establishes the relationship between watermark detectability, candidate count, and dataset size, allowing precise calibration of watermarking strength. Extensive experiments demonstrate that MUSE achieves state-of-the-art watermark detectability and robustness against various attacks while maintaining data quality, and remains compatible with any tabular generative model supporting repeated sampling, effectively addressing key challenges in tabular data watermarking. Specifically, it reduces the distortion rates on fidelity metrics by 81-89%, while achieving a 1.0 TPR@0.1%FPR detection rate. Implementation of MUSE can be found at https://github.com/fangliancheng/MUSE.
Chinese: MUSE是一种创新的表格生成模型水印算法,通过生成多个候选样本并基于专用评分函数进行选择,绕过了可逆性限制,在保持数据质量的同时实现了卓越的水印检测能力和鲁棒性。
English: MUSE is a novel watermarking algorithm for tabular generative models that bypasses invertibility limitations by generating multiple candidate samples and selecting one via a specialized scoring function, achieving superior detectability and robustness while maintaining data quality.

Authors:Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Title: Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games
Abstract:
Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at https://github.com/Stealth-py/UltimatumToM.
中文摘要:本研究证明,在谈判任务中,心理理论推理能显著提高大型语言模型代理行为与人类规范的契合度,在不同亲社会信念设置下增强决策一致性和谈判结果。
English Summary: This study demonstrates that theory-of-mind (ToM) reasoning significantly improves the alignment of LLM agent behaviors with human norms in negotiation tasks, enhancing decision-making consistency and outcomes across various prosocial belief settings.

Authors:Prasanna Reddy Pulakurthi, Majid Rabbani, Jamison Heard, Sohail Dianat, Celso M. de Melo, Raghuveer Rao
Title: Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation
Abstract:
This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable pseudo-labels to mitigate label noise. These techniques are particularly effective on smaller datasets like PACS, where overfitting and pseudo-label noise pose greater risks. State-of-the-art results are achieved on three major benchmarks: PACS, VisDA-C, and DomainNet-126. Notably, on PACS, improvements of 7.3% (79.4% to 86.7%) and 7.2% are observed in single-target and multi-target settings, respectively, while gains of 2.8% and 0.7% are attained on DomainNet-126 and VisDA-C. This combination of advanced augmentation and robust pseudo-label reweighting establishes a new benchmark for SFDA. The code is available at: https://github.com/PrasannaPulakurthi/SPM
中文: 本研究提出了用于无源域自适应的Shuffle PatchMix增强技术和新型重加权策略,通过提升数据多样性和减少伪标签噪声,在多个基准测试中实现了最先进的性能。
English: This study introduces Shuffle PatchMix augmentation and a novel reweighting strategy for Source-Free Domain Adaptation, achieving state-of-the-art performance across multiple benchmarks by enhancing data diversity and reducing pseudo-label noise.

Authors:Jiwan Chung, Janghan Yoon, Junhyeong Park, Sangeyl Lee, Joowon Yang, Sooyeon Park, Youngjae Yu
Title: Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Abstract:
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.
Chinese: 本研究引入ACON数据集评估多模态生成模型的跨模态一致性,发现统一模型在循环一致性方面未优于专用模型,但通过潜在空间的等变性分析可观测到微弱的一致性表现。
English: This study introduces the ACON dataset to evaluate cross-modal consistency in any-to-any generative models, finding they do not outperform specialized models in cyclic consistency but show weak consistency through equivariance analysis of latent spaces.

Authors:Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu
Title: STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models
Abstract:
Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (<10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at https://github.com/ZT220501/STORK.
中文: STORK方法通过同时解决ODE刚性和结构限制问题,为扩散模型和流匹配模型实现了保持质量的快速采样,持续提升了图像和视频生成效果。
English: The STORK method is introduced to enable quality-preserving fast sampling for diffusion and flow-matching models by addressing both ODE stiffness and structural limitations, consistently improving image and video generation.

Authors:Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu
Title: STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence
Abstract:
Diffusion models (DMs) and flow-matching models have demonstrated remarkable performance in image and video generation. However, such models require a significant number of function evaluations (NFEs) during sampling, leading to costly inference. Consequently, quality-preserving fast sampling methods that require fewer NFEs have been an active area of research. However, prior training-free sampling methods fail to simultaneously address two key challenges: the stiffness of the ODE (i.e., the non-straightness of the velocity field) and dependence on the semi-linear structure of the DM ODE (which limits their direct applicability to flow-matching models). In this work, we introduce the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, addressing both design concerns. We demonstrate that STORK consistently improves the quality of diffusion and flow-matching sampling for image and video generation. Code is available at https://github.com/ZT220501/STORK.
中文: STORK方法通过同时解决ODE刚性和结构限制问题,为扩散模型和流匹配模型实现了保持质量的快速采样,持续提升了图像和视频生成效果。
English: The STORK method is introduced to enable quality-preserving fast sampling for diffusion and flow-matching models by addressing both ODE stiffness and structural limitations, consistently improving image and video generation.

Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu
Title: Boosting All-in-One Image Restoration via Self-Improved Privilege Learning
Abstract:
Unified image restoration models for diverse and mixed degradations often suffer from unstable optimization dynamics and inter-task conflicts. This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that overcomes these limitations by innovatively extending the utility of privileged information (PI) beyond training into the inference stage. Unlike conventional Privilege Learning, where ground-truth-derived guidance is typically discarded after training, SIPL empowers the model to leverage its own preliminary outputs as pseudo-privileged signals for iterative self-refinement at test time. Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary. During training, this dictionary distills essential high-frequency and structural priors from privileged feature representations. Critically, at inference, the same learned dictionary then interacts with features derived from the model's initial restoration, facilitating a self-correction loop. SIPL can be seamlessly integrated into various backbone architectures, offering substantial performance improvements with minimal computational overhead. Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks. For instance, when integrated with the PromptIR model, SIPL achieves remarkable PSNR improvements of +4.58 dB on composite degradation tasks and +1.28 dB on diverse five-task benchmarks, underscoring its effectiveness and broad applicability. Codes are available at our project page https://github.com/Aitical/SIPL.
中文: 本文提出的自改进特权学习(SIPL)创新地将特权信息应用扩展至推理阶段,通过自我修正机制实现迭代优化,在多种图像修复任务中以最小计算成本取得了最先进的性能突破。
English: This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that extends privileged information utilization into the inference stage, enabling iterative self-refinement and achieving state-of-the-art performance across diverse image restoration tasks with minimal computational overhead.

Authors:Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
Title: Provably Improving Generalization of Few-Shot Models with Synthetic Data
Abstract:
Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
Chinese Summary: 本研究提出了一种理论框架和算法,通过优化的原型学习有效弥合了少样本图像分类中真实与合成数据间的分布差距,在多个数据集上超越了现有最优方法的性能表现。
English Summary: This study introduces a theoretical framework and algorithm that effectively bridge the distribution gap between real and synthetic data in few-shot image classification, achieving superior performance over state-of-the-art methods through optimized prototype learning.

Authors:Katherine Tieu, Dongqi Fu, Jun Wu, Jingrui He
Title: Invariant Link Selector for Spatial-Temporal Out-of-Distribution Problem
Abstract:
In the era of foundation models, Out-of- Distribution (OOD) problems, i.e., the data discrepancy between the training environments and testing environments, hinder AI generalization. Further, relational data like graphs disobeying the Independent and Identically Distributed (IID) condition makes the problem more challenging, especially much harder when it is associated with time. Motivated by this, to realize the robust invariant learning over temporal graphs, we want to investigate what components in temporal graphs are most invariant and representative with respect to labels. With the Information Bottleneck (IB) method, we propose an error-bounded Invariant Link Selector that can distinguish invariant components and variant components during the training process to make the deep learning model generalizable for different testing scenarios. Besides deriving a series of rigorous generalizable optimization functions, we also equip the training with task-specific loss functions, e.g., temporal link prediction, to make pretrained models solve real-world application tasks like citation recommendation and merchandise recommendation, as demonstrated in our experiments with state-of-the-art (SOTA) methods. Our code is available at https://github.com/kthrn22/OOD-Linker.
中文: 针对时序图中的分布外问题,本研究基于信息瓶颈方法提出了一种误差有界的恒定链接选择器,能区分恒定与变异组件,从而提升模型在引文推荐和商品推荐等实际任务中的泛化能力。
English: In response to Out-of-Distribution challenges in temporal graphs, this study introduces an error-bounded Invariant Link Selector using the Information Bottleneck method to identify invariant components, enhancing model generalization for tasks like citation and merchandise recommendation.

Authors:Junyu Chen, Shuwen Wei, Yihao Liu, Aaron Carass, Yong Du
Title: Pretraining Deformable Image Registration Networks with Random Images
Abstract:
Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.
中文: 最新研究表明,利用随机生成的图像对医学图像配准的深度神经网络进行预训练,可有效提升配准精度、数据利用效率和计算速度。
English: Recent research demonstrates that deep neural networks for medical image registration can be effectively pretrained using randomly generated images, improving accuracy, data efficiency, and computational speed.

Authors:Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, Hao Fei
Title: Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
Abstract:
Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.
Chinese: 本文提出Mixed-R1框架,通过混合奖励函数设计和混合后训练数据集,解决了多模态大语言模型在多源任务中稳定强化学习的难题,并在Qwen2.5-VL和Intern-VL等模型上验证了其有效性。
English: This paper introduces Mixed-R1, a unified framework that combines a mixed reward function design and a mixed post-training dataset to enable stable reinforcement learning across diverse multimodal large language model tasks, demonstrating effectiveness on models like Qwen2.5-VL and Intern-VL.

Authors:Chiwei Zhu, Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Zhendong Mao
Title: Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
Abstract:
Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at https://github.com/Ignoramus0817/rationales.
Chinese: 本研究挑战了理性增强必然提升语言模型性能的普遍观点,揭示其有时反而会损害性能却能提高可靠性,这两种效应均受任务难度驱动,为模型与人类思维的隐性对齐提供了新见解。
English: This study challenges the prevailing view that rationales always enhance language model performance, revealing they can sometimes impair it while improving reliability, with both effects driven by task difficulty and offering new insights for aligning models with human reasoning.

Authors:Jiashuai Liu, Yingjia Shang, Yingkang Zhan, Di Zhang, Yi Niu, Dong Wei, Xian Wu, Zeyu Gao, Chen Li, Yefeng Zheng
Title: The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models
Abstract:
With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textit{local perturbation with global impact} and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1\% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20\% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: https://github.com/Jiashuai-Liu-hmos/Attack-WSI-pathology-foundation-models.
Chinese: 本研究首次系统探究了病理学基础模型在全切片图像分析中的安全漏洞,提出无需下游任务标签的对抗攻击框架,仅修改0.1%图像区块即可导致下游任务准确率最高下降20%,为模型可靠性研究奠定基础。
English: This study conducts the first systematic investigation into the security vulnerabilities of pathology foundation models for whole slide image analysis, introducing a label-free adversarial attack framework that degrades downstream task accuracy by up to 20% while modifying only 0.1% of patches with imperceptible noise.

Authors:Peiran Xu, Yadong Mu
Title: Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
Abstract:
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG-PLSP .
Chinese: 本研究提出了一种弱监督功能定位模型,利用基础模型和伪标签来弥合物体与动作之间的鸿沟,通过标签精炼、特征对齐和推理模块实现了突破性的性能提升。
English: This study introduces a weakly supervised affordance grounding model that leverages foundation models and pseudo labels to bridge the gap between objects and actions, achieving breakthrough performance improvements through label refinement, feature alignment, and a reasoning module.

Authors:Sayed T. Nowroz, Nermeen M. Saleh, Siam Shakur, Sean Banerjee, Fathi Amsaad
Title: A Benchmark Reference for ESP32-CAM Module
Abstract:
The ESP32-CAM is one of the most widely adopted open-source modules for prototyping embedded vision applications. Since its release in 2019, it has gained popularity among both hobbyists and professional developers due to its affordability, versatility, and integrated wireless capabilities. Despite its widespread use, comprehensive documentation of the performance metrics remains limited. This study addresses this gap by collecting and analyzing over six hours of real-time video streaming logs across all supported resolutions of the OV2640 image sensor, tested under five distinct voltage conditions via an HTTP-based WiFi connection. A long standing bug in the official Arduino ESP32 driver, responsible for inaccurate frame rate logging, was fixed. The resulting analysis includes key performance metrics such as instantaneous and average frame rate, total streamed data, transmission count, and internal chip temperature. The influence of varying power levels was evaluated to assess the reliability of the module.
中文: 本研究通过分析ESP32-CAM在五种电压条件下六小时以上的OV2640传感器实时视频流数据,填补了性能指标文档空白,提供了关键性能参数并评估了模块可靠性。
English: This study fills a documentation gap by analyzing over six hours of ESP32-CAM video streaming data across all OV2640 sensor resolutions under varying voltage conditions, providing key performance metrics and assessing module reliability.

Authors:Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu
Title: Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution
Abstract:
Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods, such as full fine-tuning and LoRA, fail to preserve sparsity as they require updating the whole dense metrics, not well-suited for sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a novel method designed specifically for sparse LLMs. SEFT dynamically evolves the sparse topology of pruned models during fine-tuning, while preserving the overall sparsity throughout the process. The strengths of SEFT lie in its ability to perform task-specific adaptation through a weight drop-and-grow strategy, enabling the pruned model to self-adapt its sparse connectivity pattern based on the target dataset. Furthermore, a sensitivity-driven pruning criterion is employed to ensure that the desired sparsity level is consistently maintained throughout fine-tuning. Our experiments on various LLMs, including LLaMA families, DeepSeek, and Mistral, across a diverse set of benchmarks demonstrate that SEFT achieves stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.
中文: 本文提出SEFT方法,通过动态优化剪枝后大语言模型的稀疏拓扑结构进行微调,在保持稀疏性的同时显著提升模型性能,在多项测试中展现出卓越的效率和效果优势。
English: This paper introduces SEFT, a novel fine-tuning method that dynamically adjusts the sparse structure of pruned large language models to enhance performance while maintaining sparsity, demonstrating superior efficiency and effectiveness across multiple benchmarks.

Authors:Amel Gader, Alsayed Algergawy
Title: GenIC: An LLM-Based Framework for Instance Completion in Knowledge Graphs
Abstract:
Knowledge graph completion aims to address the gaps of knowledge bases by adding new triples that represent facts. The complexity of this task depends on how many parts of a triple are already known. Instance completion involves predicting the relation-tail pair when only the head is given (h, ?, ?). Notably, modern knowledge bases often contain entity descriptions and types, which can provide valuable context for inferring missing facts. By leveraging these textual descriptions and the ability of large language models to extract facts from them and recognize patterns within the knowledge graph schema, we propose an LLM-powered, end-to-end instance completion approach. Specifically, we introduce GenIC: a two-step Generative Instance Completion framework. The first step focuses on property prediction, treated as a multi-label classification task. The second step is link prediction, framed as a generative sequence-to-sequence task. Experimental results on three datasets show that our method outperforms existing baselines. Our code is available at https://github.com/amal-gader/genic.
中文摘要:作者提出GenIC生成式实例补全框架,利用实体描述和大语言模型预测知识图谱中缺失的关系与实体,在实验中展现出优于基线方法的性能。
English Summary: The authors propose GenIC, a generative instance completion framework that leverages entity descriptions and large language models to predict missing relations and entities in knowledge graphs, demonstrating superior performance over baselines in experiments.

Authors:Vishal Dey, Xiao Hu, Xia Ning
Title: Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization
Abstract:
In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.
中文摘要:本研究提出了首个针对多属性分子优化的指令调优数据集C-MuMOInstruct,并开发了GeLLMO-Cs模型系列,该模型在保持特定属性的同时显著提升其他分子性能,成功率达基线126%以上,且对新任务展现出卓越的零样本泛化能力。
English Summary: This study introduces C-MuMOInstruct, the first instruction-tuning dataset for multi-property molecular optimization, and develops GeLLMO-Cs models that significantly outperform baselines with up to 126% higher success rates while demonstrating strong generalization to novel tasks.

Authors:Feiteng Fang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xiang Huang, Dingwei Chen, Jing Ye, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Fei Huang, Yongbin Li
Title: ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents
Abstract:
Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.
Chinese: 提出的ChARM模型通过行为自适应边界和自我进化机制改进了角色扮演语言代理,在偏好排名上提升了13%,并在评估基准中取得了最优结果。
English: The proposed ChARM model enhances role-playing language agents with an act-adaptive margin and self-evolution mechanism, achieving a 13% improvement in preference rankings and state-of-the-art results on evaluation benchmarks.

Authors:David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, Xiaojie Jin
Title: ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Abstract:
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
中文:ScaleLong基准通过在相同视频内容中嵌入多时间尺度问题,实现了模型在分层时间级别上的直接性能比较,揭示了U形准确率曲线并证明增加视觉标记容量可提升推理能力。
English: The ScaleLong benchmark introduces multi-timescale questions within the same video content to enable direct comparison of model performance across hierarchical temporal levels, revealing a U-shaped accuracy curve and demonstrating that increased visual token capacity improves reasoning.

Authors:Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li
Title: OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Abstract:
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
中文:Workforce提出了一种分层多智能体框架,通过将规划与执行解耦,借助模块化智能体和优化训练实现跨领域适应能力,在GAIA等基准测试中取得了领先性能。
English: Workforce introduces a hierarchical multi-agent framework that decouples planning from execution, enabling cross-domain adaptability through modular agents and optimized training, achieving state-of-the-art performance on benchmarks like GAIA.

Authors:Hongrui Peng, Haolang Lu, Yuanlong Yu, Weiye Fu, Kun Wang, Guoshun Nan
Title: KGMark: A Diffusion Watermark for Knowledge Graphs
Abstract:
Knowledge graphs (KGs) are ubiquitous in numerous real-world applications, and watermarking facilitates protecting intellectual property and preventing potential harm from AI-generated content. Existing watermarking methods mainly focus on static plain text or image data, while they can hardly be applied to dynamic graphs due to spatial and temporal variations of structured data. This motivates us to propose KGMARK, the first graph watermarking framework that aims to generate robust, detectable, and transparent diffusion fingerprints for dynamic KG data. Specifically, we propose a novel clustering-based alignment method to adapt the watermark to spatial variations. Meanwhile, we present a redundant embedding strategy to harden the diffusion watermark against various attacks, facilitating the robustness of the watermark to the temporal variations. Additionally, we introduce a novel learnable mask matrix to improve the transparency of diffusion fingerprints. By doing so, our KGMARK properly tackles the variation challenges of structured data. Experiments on various public benchmarks show the effectiveness of our proposed KGMARK. Our code is available at https://github.com/phrara/kgmark.
中文摘要:KGMARK是首个为动态知识图谱设计的图水印框架,通过基于聚类的对齐方法和冗余嵌入策略,有效应对结构化数据的时空变化,生成鲁棒、可检测且透明的扩散指纹。
English Summary: KGMARK is the first graph watermarking framework designed to create robust, detectable, and transparent diffusion fingerprints for dynamic knowledge graphs, effectively addressing spatial and temporal variations through clustering-based alignment and redundant embedding strategies.

Authors:Wei Zhuo, Zhaohuan Zhan, Han Yu
Title: Personalized Subgraph Federated Learning with Differentiable Auxiliary Projections
Abstract:
Federated Learning (FL) on graph-structured data typically faces non-IID challenges, particularly in scenarios where each client holds a distinct subgraph sampled from a global graph. In this paper, we introduce Federated learning with Auxiliary projections (FedAux), a personalized subgraph FL framework that learns to align, compare, and aggregate heterogeneously distributed local models without sharing raw data or node embeddings. In FedAux, each client jointly trains (i) a local GNN and (ii) a learnable auxiliary projection vector (APV) that differentiably projects node embeddings onto a 1D space. A soft-sorting operation followed by a lightweight 1D convolution refines these embeddings in the ordered space, enabling the APV to effectively capture client-specific information. After local training, these APVs serve as compact signatures that the server uses to compute inter-client similarities and perform similarity-weighted parameter mixing, yielding personalized models while preserving cross-client knowledge transfer. Moreover, we provide rigorous theoretical analysis to establish the convergence and rationality of our design. Empirical evaluations across diverse graph benchmarks demonstrate that FedAux substantially outperforms existing baselines in both accuracy and personalization performance. The code is available at https://github.com/JhuoW/FedAux.
中文摘要:FedAux是一种面向图数据的个性化联邦学习框架,通过辅助投影向量在不共享原始数据的情况下对齐和聚合本地模型,在多个基准测试中实现了卓越的准确性和个性化性能。
English Summary: FedAux is a personalized federated learning framework for graph data that uses auxiliary projection vectors to align and aggregate local models without sharing raw data, achieving superior accuracy and personalization across benchmarks.

Authors:Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Title: Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting
Abstract:
Long-term forecasting of chaotic systems from short-term observations remains a fundamental and underexplored challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Existing approaches often rely on long-term training data or focus on short-term sequence correlations, struggling to maintain predictive stability and dynamical coherence over extended horizons. We propose PhyxMamba, a novel framework that integrates a Mamba-based state-space model with physics-informed principles to capture the underlying dynamics of chaotic systems. By reconstructing the attractor manifold from brief observations using time-delay embeddings, PhyxMamba extracts global dynamical features essential for accurate forecasting. Our generative training scheme enables Mamba to replicate the physical process, augmented by multi-token prediction and attractor geometry regularization for physical constraints, enhancing prediction accuracy and preserving key statistical invariants. Extensive evaluations on diverse simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior long-term forecasting and faithfully captures essential dynamical invariants from short-term data. This framework opens new avenues for reliably predicting chaotic systems under observation-scarce conditions, with broad implications across climate science, neuroscience, epidemiology, and beyond. Our code is open-source at https://github.com/tsinghua-fib-lab/PhyxMamba.
中文摘要:PhyxMamba是一种新颖的物理信息Mamba框架,通过重构吸引子流形并保持动力学不变量,能够基于短期观测数据实现对混沌系统的精准长期预测。
English Summary: PhyxMamba is a novel physics-informed Mamba framework that accurately forecasts chaotic systems long-term from short observations by reconstructing attractor manifolds and preserving dynamical invariants.

Authors:Zheng Gong, Ziyi Jiang, Weihao Gao, Deng Zhuo, Lan Ma
Title: A New Deep-learning-Based Approach For mRNA Optimization: High Fidelity, Computation Efficiency, and Multiple Optimization Factors
Abstract:
The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi-objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce \textbf{RNop}, a novel deep learning-based method for mRNA optimization. We collect a large-scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species-specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then, we demonstrate RNop's effectiveness through extensive in silico and in vivo experiments. RNop ensures high sequence fidelity, achieves significant computational throughput up to 47.32 sequences/s, and yields optimized mRNA sequences resulting in a significant increase in protein expression for functional proteins compared to controls. RNop surpasses current methodologies in both quantitative metrics and experimental validation, enlightening a new dawn for efficient and effective mRNA design. Code and models will be available at https://github.com/HudenJear/RPLoss.
Chinese: RNop是一种新型的深度学习方法,通过确保高序列保真度、计算效率及对密码子适应性、tRNA可用性和mRNA结构的多目标控制,克服了当前mRNA优化的局限,显著提高了蛋白质表达水平。
English: RNop is a novel deep learning method that overcomes current limitations in mRNA optimization by ensuring high sequence fidelity, computational efficiency, and multi-objective control over codon adaptation, tRNA availability, and mRNA structure, resulting in significantly enhanced protein expression.

Authors:Renye Zhang, Mengyun Yang, Qichang Zhao, Jianxin Wang
Title: BiBLDR: Bidirectional Behavior Learning for Drug Repositioning
Abstract:
Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel drugs because of the lack of association information with the diseases. Unlike traditional graph-based approaches, we propose a bidirectional behavior learning strategy for drug repositioning, known as BiBLDR. This innovative framework redefines drug repositioning as a behavior sequential learning task to capture drug-disease interaction patterns. First, we construct bidirectional behavioral sequences based on drug and disease sides. The consideration of bidirectional information ensures a more meticulous and rigorous characterization of the behavioral sequences. Subsequently, we propose a two-stage strategy for drug repositioning. In the first stage, we construct prototype spaces to characterize the representational attributes of drugs and diseases. In the second stage, these refined prototypes and bidirectional behavior sequence data are leveraged to predict potential drug-disease associations. Based on this learning approach, the model can more robustly and precisely capture the interactive relationships between drug and disease features from bidirectional behavioral sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on benchmark datasets. Meanwhile, BiBLDR demonstrates significantly superior performance compared to previous methods in cold-start scenarios. Our code is published in https://github.com/Renyeeah/BiBLDR.
中文摘要:本文提出的BiBLDR框架将药物重定位重新定义为双向行为学习任务,通过构建原型空间和利用双向行为序列数据,有效解决了传统图方法在冷启动场景下对新药关联预测的局限性。
English Summary: The proposed BiBLDR framework redefines drug repositioning as a bidirectional behavior learning task to overcome cold-start limitations of graph-based methods by capturing drug-disease interaction patterns through prototype spaces and sequential data analysis.

Authors:Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh
Title: OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
Abstract:
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
中文摘要:本文提出OMNIGUARD方法,通过利用大语言模型中跨语言和跨模态对齐的内部表征来检测有害提示,在多语言和跨模态场景下显著提升了分类准确率与检测效率。
English Summary: The paper introduces OMNIGUARD, a method that detects harmful prompts across languages and modalities by leveraging aligned internal representations of LLMs, significantly improving classification accuracy and efficiency over existing baselines.

Authors:Michael Shalyt, Rotem Elimelech, Ido Kaminer
Title: ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Abstract:
Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.
中文: 大型语言模型在符号数学方面虽取得进展,但ASyMOB评估框架显示其泛化能力不足,主要依赖记忆而非深层理解;不过顶尖模型如o4-mini和Gemini 2.5 Flash表现出卓越的解题能力和抗干扰性。
English: Large language models are advancing in symbolic mathematics but struggle with generalization, as shown by the ASyMOB framework, which reveals their reliance on memorization rather than true understanding, though top models like o4-mini and Gemini 2.5 Flash exhibit high proficiency and robustness.

Authors:Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang
Title: Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation
Abstract:
Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at https://github.com/ZLKong/LLM_Integration
中文摘要:我们提出的框架自适应地选择和整合来自不同大语言模型的知识,通过自适应选择网络和动态融合策略构建更强大的单一模型,将知识干扰降低高达50%并减少内存开销。
English Summary: Our proposed framework adaptively selects and aggregates knowledge from diverse LLMs to build a stronger single model, reducing memory overhead and knowledge interference by up to 50% through an adaptive selection network and dynamic fusion strategy.

Authors:Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi
Title: Measuring Sycophancy of Language Models in Multi-turn Dialogues
Abstract:
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
中文摘要:大型语言模型常表现出迎合用户观点的谄媚行为,SYCON Bench基准测试表明该现象普遍存在,同时证明推理优化和第三人称提示能显著降低谄媚倾向。
English Summary: Large Language Models frequently exhibit sycophancy by conforming to user beliefs, and the SYCON Bench benchmark reveals this behavior persists across models while showing that reasoning optimization and third-person prompting can significantly reduce it.

Authors:Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang
Title: GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance
Abstract:
DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin-producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high-homology, non-pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer generation toward pathogen-like sequences, and (3) a BLAST-based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker.
中文摘要:本文提出GeneBreaker框架,首次系统评估DNA基础模型的越狱漏洞,成功生成类病原体序列并揭示其生物安全风险,表明模型扩展会放大双用途风险。
English Summary: This paper introduces GeneBreaker, a framework that systematically tests and reveals jailbreak vulnerabilities in DNA foundation models, successfully generating pathogen-like sequences and highlighting significant biosecurity risks.

Authors:Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens
Title: A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Abstract:
Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.
中文: 当前大型语言模型在可控性方面存在不足,表现出覆盖率低、校准偏差和副作用持续等问题,现有对齐策略虽效果各异但仍显不足。
English: Current large language models struggle with steerability, exhibiting issues like poor coverage, miscalibration, and persistent side effects, and existing alignment strategies remain insufficient despite varying effectiveness.

Authors:Abhijit Talluri
Title: DP-RTFL: Differentially Private Resilient Temporal Federated Learning for Trustworthy AI in Regulated Industries
Abstract:
Federated Learning (FL) has emerged as a critical paradigm for enabling privacy-preserving machine learning, particularly in regulated sectors such as finance and healthcare. However, standard FL strategies often encounter significant operational challenges related to fault tolerance, system resilience against concurrent client and server failures, and the provision of robust, verifiable privacy guarantees essential for handling sensitive data. These deficiencies can lead to training disruptions, data loss, compromised model integrity, and non-compliance with data protection regulations (e.g., GDPR, CCPA). This paper introduces Differentially Private Resilient Temporal Federated Learning (DP-RTFL), an advanced FL framework designed to ensure training continuity, precise state recovery, and strong data privacy. DP-RTFL integrates local Differential Privacy (LDP) at the client level with resilient temporal state management and integrity verification mechanisms, such as hash-based commitments (referred to as Zero-Knowledge Integrity Proofs or ZKIPs in this context). The framework is particularly suited for critical applications like credit risk assessment using sensitive financial data, aiming to be operationally robust, auditable, and scalable for enterprise AI deployments. The implementation of the DP-RTFL framework is available as open-source.
Chinese: 联邦学习在容错性和隐私保护方面存在操作挑战,本文提出的DP-RTFL框架通过整合本地差分隐私和弹性状态管理机制,为敏感数据应用提供了具备可审计性的鲁棒训练解决方案。
English: Federated Learning faces operational challenges in fault tolerance and privacy, which the proposed DP-RTFL framework addresses by integrating local differential privacy and resilient state management to ensure robust, auditable training for sensitive applications.

Authors:Lin Mu, Xiaoyu Wang, Li Ni, Yang Li, Zhize Wu, Peiquan Jin, Yiwen Zhang
Title: DenseLoRA: Dense Low-Rank Adaptation of Large Language Models
Abstract:
Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at https://github.com/mulin-ahu/DenseLoRA.
中文: DenseLoRA通过使用稠密低秩矩阵替代冗余的低秩矩阵来提升大语言模型适配中的参数效率,在LLaMA3-8B上仅用0.01%可训练参数就实现了83.8%的准确率。
English: DenseLoRA enhances parameter efficiency in adapting large language models by using a dense low-rank matrix instead of redundant low-rank matrices, achieving 83.8% accuracy with only 0.01% trainable parameters on LLaMA3-8B.

Authors:Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang
Title: DLP: Dynamic Layerwise Pruning in Large Language Models
Abstract:
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.
中文: 动态分层剪枝(DLP)方法通过整合模型权重和输入激活信息自适应地确定各层重要性,在高稀疏度下有效保持大语言模型性能,并能与现有压缩技术无缝兼容。
English: The proposed Dynamic Layerwise Pruning (DLP) method adaptively determines layer importance by integrating model weights and input activations, effectively preserving LLM performance at high sparsity levels while demonstrating compatibility with existing compression techniques.

Authors:Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang, Zhangyue Yin, Xipeng Qiu
Title: R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning
Abstract:
Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.
中文摘要:R3-RAG通过强化学习框架使大语言模型自主掌握交替推理与检索的迭代策略,在提升答案准确性和系统性能方面显著优于现有基线方法。
English Summary: R3-RAG introduces a reinforcement learning framework that enables Large Language Models to autonomously learn iterative reasoning and retrieval strategies, significantly improving answer accuracy and outperforming existing methods.

Authors:Sefik Serengil, Alper Ozpinar
Title: LightDSA: A Python-Based Hybrid Digital Signature Library and Performance Analysis of RSA, DSA, ECDSA and EdDSA in Variable Configurations, Elliptic Curve Forms and Curves
Abstract:
Digital signature algorithms (DSAs) are fundamental to cryptographic security, ensuring data integrity and authentication. While RSA, DSA, ECDSA, and EdDSA are widely used, their performance varies significantly depending on key sizes, hash functions, and elliptic curve configurations. In this paper, we introduce LightDSA, a hybrid and configurable digital signature library that supports RSA, DSA, ECDSA, and EdDSA with flexible form and curve selection, open sourced at https://github.com/serengil/LightDSA. Unlike conventional implementations that impose strict curve-form mappings - such as Weierstrass for ECDSA and Edwards for EdDSA LightDSA - allows arbitrary combinations, enabling a broader performance evaluation. We analyze the computational efficiency of these algorithms across various configurations, comparing key generation, signing, and verification times. Our results provide insights into the trade-offs between security and efficiency, guiding the selection of optimal configurations for different cryptographic needs.
Chinese: 本文介绍了LightDSA,一个灵活的开源数字签名库,支持多种算法和可定制配置,能够进行全面性能分析,并为不同安全与效率需求提供最优选择指导。
English: This paper introduces LightDSA, a flexible and open-source digital signature library that supports multiple algorithms with customizable configurations, enabling comprehensive performance analysis and optimal selection for various security and efficiency requirements.

Authors:Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
Title: TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Abstract:
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.
中文: TextRegion框架巧妙融合图文模型与SAM2,生成文本对齐的区域标记,无需训练即可在开放世界分割等任务中实现卓越性能,且兼容多种模型,便于扩展应用。
English: The proposed TextRegion framework effectively combines image-text models with SAM2 to produce text-aligned region tokens, enabling superior performance in tasks like open-world segmentation without requiring training, while remaining compatible with various models for easy extension.

Authors:Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
Title: ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Abstract:
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
中文:ZeroGUI是一种无需人工干预的在线学习框架,通过基于视觉语言模型的任务生成与奖励评估,自动训练图形界面代理,显著提升了在动态环境中的性能表现。
English: ZeroGUI is an online learning framework that automates GUI agent training without human intervention by using VLM-based task generation and reward estimation, significantly enhancing performance in dynamic environments.

Authors:Amber Yijia Zheng, Cedar Site Bai, Brian Bullins, Raymond A. Yeh
Title: Model Immunization from a Condition Number Perspective
Abstract:
Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at https://github.com/amberyzheng/model-immunization-cond-num.
Chinese: 本研究提出了一个基于Hessian矩阵条件数的框架来分析模型免疫,并通过设计控制条件数的正则化算法,在线性模型和非线性深度网络中成功实现了对抗有害微调的模型免疫保护。
English: This study introduces a framework using the Hessian matrix's condition number to analyze and achieve model immunization, proposing an algorithm that effectively immunizes both linear and non-linear models against harmful fine-tuning while preserving utility.

Authors:Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao
Title: Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
Chinese Summary: Impromptu VLA数据集通过提供超过80,000个精心筛选的视频片段及丰富注释,解决了自动驾驶视觉语言动作模型缺乏针对性基准的问题,显著提升了模型在各项指标上的表现。
English Summary: The Impromptu VLA dataset addresses the lack of specialized benchmarks for autonomous driving Vision-Language-Action models by providing over 80,000 curated video clips with comprehensive annotations, significantly enhancing model performance across multiple metrics.

Authors:Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink
Title: To Trust Or Not To Trust Your Vision-Language Model's Prediction
Abstract:
Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code is available at https://github.com/EPFL-IMOS/TrustVLM.
Chinese: TrustVLM是一种无需训练的框架,通过基于图像嵌入的新型置信度评分函数提升视觉语言模型的可靠性,在多个数据集和架构上实现了误分类检测的最先进性能。
English: TrustVLM is a training-free framework that enhances the reliability of Vision-Language Models by introducing a novel confidence-scoring function based on image embeddings, achieving state-of-the-art performance in misclassification detection across multiple datasets and architectures.

Authors:Qiang Wang, Xiang Song, Yuhang He, Jizhou Han, Chenhao Ding, Xinyuan Gao, Yihong Gong
Title: Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
Abstract:
Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO's consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in https://github.com/qwangcv/SOYO.
中文摘要:SOYO是一个轻量级框架,通过引入高斯混合压缩器、领域特征重采样器和多级领域特征融合网络,改进了参数隔离领域增量学习中的领域选择,在多个基准测试中展现出卓越性能。
English Summary: SOYO is a lightweight framework that enhances domain selection in Parameter-Isolation Domain Incremental Learning by incorporating a Gaussian Mixture Compressor, Domain Feature Resampler, and Multi-level Domain Feature Fusion Network, demonstrating superior performance across multiple benchmarks.

Authors:Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
Title: MAGREF: Masked Guidance for Any-Reference Video Generation
Abstract:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
中文: 我们提出MAGREF这一统一框架,通过掩码引导和主体解耦机制解决任意参考视频生成中的身份不一致、多主体纠缠和复制粘贴伪影问题,实现了基于多样化参考图像与文本提示的高质量视频合成,性能超越现有最优方法。
English: We introduce MAGREF, a unified framework for any-reference video generation that addresses identity inconsistency, subject entanglement, and copy-paste artifacts through masked guidance and subject disentanglement mechanisms, achieving state-of-the-art performance in synthesizing videos from diverse references and text prompts.

Authors:Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
Title: MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
Abstract:
We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
中文: 我们提出MAGREF这一统一框架,通过掩码引导和主体解耦机制解决任意参考视频生成中的身份不一致、多主体纠缠和复制粘贴伪影问题,实现了基于多样化参考图像与文本提示的高质量视频合成,性能超越现有最优方法。
English: We introduce MAGREF, a unified framework for any-reference video generation that addresses identity inconsistency, subject entanglement, and copy-paste artifacts through masked guidance and subject disentanglement mechanisms, achieving state-of-the-art performance in synthesizing videos from diverse references and text prompts.

Authors:Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang
Title: ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS
Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.
中文摘要:ZPressor是一种轻量级模块,能够将多视角输入高效压缩为紧凑的潜在表示,使前馈式3D高斯溅射模型在处理密集视角输入时显著提升扩展性和性能表现。
English Summary: ZPressor is a lightweight module that enables feed-forward 3D Gaussian Splatting models to efficiently compress multi-view inputs into a compact latent representation, significantly improving scalability and performance with dense view inputs.

Authors:Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Title: Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models
Abstract:
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.
Chinese: 大型语言模型常无法批判性评估错误前提,因此开发PCBench以评估和提升其前提批判能力,揭示关键弱点并强调增强输入有效性评估的必要性。
English: Large language models often fail to critically evaluate flawed premises, prompting the development of PCBench to assess and improve their premise critique ability, revealing key vulnerabilities and the need for enhanced input validity evaluation.

Authors:Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, Xiuying Chen
Title: SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Abstract:
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze
中文摘要:SocialMaze基准测试通过涵盖深度推理、动态交互和信息不确定性的六项任务,系统评估大语言模型的社交推理能力,揭示了模型在处理动态信息和不确定性时的表现差异,并证明针对性微调可显著提升复杂社交场景中的表现。
English Summary: The SocialMaze benchmark is introduced to systematically evaluate large language models' social reasoning capabilities through six tasks addressing deep reasoning, dynamic interaction, and information uncertainty, revealing performance variations and demonstrating that targeted fine-tuning enhances performance in complex social scenarios.

Authors:Mohamad Alansari, Sajid Javed, Iyyakutti Iyappan Ganapathi, Sara Alansari, Muzammal Naseer
Title: CLDTracker: A Comprehensive Language Description for Visual Tracking
Abstract:
VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker
中文: CLDTracker提出了一种双分支框架,通过融合视觉语言模型生成的丰富自适应文本描述来增强视觉目标跟踪性能,在多个基准测试中实现了最先进的结果。
English: CLDTracker introduces a dual-branch framework that enhances visual object tracking by integrating rich, adaptive textual descriptions from vision-language models, achieving state-of-the-art performance across multiple benchmarks.

Authors:Ran Zhang, Mohannad Elhamod
Title: Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics
Abstract:
The rapid advancement of LLMs has led to the creation of diverse agentic systems in data analysis, utilizing LLMs' capabilities to improve insight generation and visualization. In this paper, we present an agentic system that automates the data-to-dashboard pipeline through modular LLM agents capable of domain detection, concept extraction, multi-perspective analysis generation, and iterative self-reflection. Unlike existing chart QA systems, our framework simulates the analytical reasoning process of business analysts by retrieving domain-relevant knowledge and adapting to diverse datasets without relying on closed ontologies or question templates. We evaluate our system on three datasets across different domains. Benchmarked against GPT-4o with a single-prompt baseline, our approach shows improved insightfulness, domain relevance, and analytical depth, as measured by tailored evaluation metrics and qualitative human assessment. This work contributes a novel modular pipeline to bridge the path from raw data to visualization, and opens new opportunities for human-in-the-loop validation by domain experts in business analytics. All code can be found here: https://github.com/77luvC/D2D_Data2Dashboard
中文: 本文提出了一种模块化LLM智能体系统,通过模拟商业分析师的分析推理过程,实现了从原始数据到可视化看板的自动化流程,在多个领域数据集上展现出比基线方法更优的洞察深度与领域相关性。
English: This paper introduces a modular LLM-based agentic system that automates the data-to-dashboard pipeline by simulating business analysts' reasoning, demonstrating superior insightfulness and domain relevance over baseline methods across diverse datasets.

Authors:Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo
Title: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
中文摘要:大型语言模型在工具使用方面表现出色,但在长期交互中表现显著不足,新推出的ToolHaystack基准测试揭示了现有模型在持续对话中的鲁棒性缺陷,这是以往评测未能发现的。
English Summary: Large language models show strong tool-use capabilities but struggle significantly in long-term interactions, as revealed by the new ToolHaystack benchmark that exposes critical gaps in their robustness not captured by previous evaluations.

Authors:Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy
Title: OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
Abstract:
In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.
中文: 本报告介绍了OpenUni,一个简单、轻量级且完全开源的基准模型,通过高效训练策略连接现有多模态大语言模型和扩散模型,以最少参数实现高质量多模态理解与生成。
English: This report introduces OpenUni, a simple, lightweight, and fully open-source baseline that unifies multimodal understanding and generation using an efficient training strategy to connect off-the-shelf models and achieves high-quality results with minimal parameters.

Authors:Ziteng Gao, Mike Zheng Shou
Title: D-AR: Diffusion via Autoregressive Models
Abstract:
This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR
中文: 本文提出D-AR方法,将图像扩散过程重构为标准自回归的下一令牌预测任务,通过离散令牌序列实现由粗到精的图像生成,在ImageNet基准上使用7.75亿参数Llama模型取得了2.09的FID分数。
English: This paper introduces D-AR, a novel approach that reformulates image diffusion as an autoregressive next-token prediction process, enabling coarse-to-fine image generation with properties like consistent previews and achieving 2.09 FID on ImageNet using a 775M Llama model.

Authors:Juncheol Shin, Minsang Seok, Seonggon Kim, Eunhyeok Park
Title: Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation
Abstract:
Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
Chinese: 本研究提出了HDRQ,一种新颖的训练后量化方法,通过最小化与源模型的偏差并平滑损失表面,解决了量化带来的挑战,实现了多目标领域自适应中的有效模型融合。
English: This study introduces HDRQ, a novel post-training quantization method that minimizes deviation from the source model and flattens the loss surface to enable effective model merging for multi-target domain adaptation, addressing challenges posed by quantization.

Authors:Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Santiago Zanella-Béguelin
Title: Securing AI Agents with Information-Flow Control
Abstract:
As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach enables us to complete a broad range of tasks with security guarantees. A tutorial to walk readers through the the concepts introduced in the paper can be found at https://github.com/microsoft/fides
中文: 本文提出采用信息流控制技术来防范AI代理的提示注入等安全漏洞,介绍了Fides规划器——它能在保持实用性的同时强制执行安全策略,评估表明其在多种任务中均能有效保障安全。
English: This paper proposes using information-flow control to secure AI agents against vulnerabilities like prompt injection, introducing Fides—a planner that enforces security policies while maintaining utility, with evaluation showing its effectiveness across diverse tasks.

Authors:Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song
Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Abstract:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
中文摘要:AutoSchemaKG是一个完全自主的框架,利用大语言模型从文本中构建知识图谱而无需预定义模式,通过动态模式归纳实现了与人工模式92%的语义对齐,有效增强了大型语言模型的事实性。
English Summary: AutoSchemaKG is a fully autonomous framework that uses large language models to construct knowledge graphs from text without predefined schemas, achieving high schema alignment and enhancing LLM factuality through dynamic schema induction.

Authors:Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Title: Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Abstract:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
中文: Muddit作为一种统一的离散扩散变换器,通过将预训练骨干网络的强大视觉先验与轻量级文本解码器相结合,实现了快速、并行且高质量的多模态生成,在效率和性能上均优于更大的自回归模型。
English: Muddit is a unified discrete diffusion transformer that enables fast, parallel, and high-quality multimodal generation by integrating strong visual priors from a pretrained backbone with a lightweight text decoder, outperforming larger autoregressive models in efficiency and quality.

Authors:Youssef Mohamed, Noran Mohamed, Khaled Abouhashad, Feilong Tang, Sara Atito, Shoaib Jameel, Imran Razzak, Ahmed B. Zaky
Title: DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification
Abstract:
While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at https://github.com/youssefkhalil320/DeepChest-MTL
中文: 本文提出的DeepChest框架通过动态任务加权机制,无需梯度访问即可自适应调整多标签胸片分类任务权重,在提升训练效率三倍的同时,整体准确率较现有最优方法提高7%。
English: This paper introduces DeepChest, a dynamic task-weighting framework for multi-label chest X-ray classification that enhances training efficiency and accuracy by adaptively balancing task contributions without gradient access, achieving a 7% improvement over state-of-the-art methods.

Authors:Xi Chen, Soham Jana, Christopher A. Metzler, Arian Maleki, Shirin Jalali
Title: Multilook Coherent Imaging: Theoretical Guarantees and Algorithms
Abstract:
Multilook coherent imaging is a widely used technique in applications such as digital holography, ultrasound imaging, and synthetic aperture radar. A central challenge in these systems is the presence of multiplicative noise, commonly known as speckle, which degrades image quality. Despite the widespread use of coherent imaging systems, their theoretical foundations remain relatively underexplored. In this paper, we study both the theoretical and algorithmic aspects of likelihood-based approaches for multilook coherent imaging, providing a rigorous framework for analysis and method development. Our theoretical contributions include establishing the first theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our results capture the dependence of MSE on the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we employ projected gradient descent (PGD) as an efficient method for computing the maximum likelihood solution. Furthermore, we introduce two key ideas to enhance the practical performance of PGD. First, we incorporate the Newton-Schulz algorithm to compute matrix inverses within the PGD iterations, significantly reducing computational complexity. Second, we develop a bagging strategy to mitigate projection errors introduced during PGD updates. We demonstrate that combining these techniques with PGD yields state-of-the-art performance. Our code is available at https://github.com/Computational-Imaging-RU/Bagged-DIP-Speckle.
中文: 本文为多视相干成像建立了理论框架,首次推导出深度图像先验下最大似然估计的均方误差上界,并提出结合矩阵逆优化和误差抑制的投影梯度下降算法,实现了最先进的性能。
English: This paper establishes a theoretical framework for multilook coherent imaging, deriving the first MSE bound for maximum likelihood estimation under deep image priors and proposing an enhanced projected gradient descent algorithm with matrix inversion optimization and error mitigation for state-of-the-art performance.

Authors:Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Abstract:
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.
中文摘要:本研究以拼图游戏为实验框架探索基于规则的多模态大语言模型强化学习,发现微调能使模型从接近随机猜测提升至近乎完美的准确率并泛化至复杂配置,且强化学习比监督微调具有更好的泛化效果。
English Summary: This study explores rule-based reinforcement learning in multimodal large language models using jigsaw puzzles, revealing that fine-tuning enables models to progress from random guessing to near-perfect accuracy and generalize to complex configurations, with RL outperforming supervised fine-tuning in generalization capability.

Authors:Ramit Aditya, Razvan Bunescu, Smita Nannaware, Erfan Al-Hossami
Title: Engineering Serendipity through Recommendations of Items with Atypical Aspects
Abstract:
A restaurant dinner or a hotel stay may lead to memorable experiences when guests encounter unexpected aspects that also match their interests. For example, an origami-making station in the waiting area of a restaurant may be both surprising and enjoyable for a customer who is passionate about paper crafts. Similarly, an exhibit of 18th century harpsichords would be atypical for a hotel lobby and likely pique the interest of a guest who has a passion for Baroque music. Motivated by this insight, in this paper we introduce the new task of engineering serendipity through recommendations of items with atypical aspects. We describe an LLM-based system pipeline that extracts atypical aspects from item reviews, then estimates and aggregates their user-specific utility in a measure of serendipity potential that is used to rerank a list of items recommended to the user. To facilitate system development and evaluation, we introduce a dataset of Yelp reviews that are manually annotated with atypical aspects and a dataset of artificially generated user profiles, together with crowdsourced annotations of user-aspect utility values. Furthermore, we introduce a custom procedure for dynamic selection of in-context learning examples, which is shown to improve LLM-based judgments of atypicality and utility. Experimental evaluations show that serendipity-based rankings generated by the system are highly correlated with ground truth rankings for which serendipity scores are computed from manual annotations of atypical aspects and their user-dependent utility. Overall, we hope that the new recommendation task and the associated system presented in this paper catalyze further research into recommendation approaches that go beyond accuracy in their pursuit of enhanced user satisfaction. The datasets and the code are made publicly available at https://github.com/ramituncc49er/ATARS .
中文摘要:本文提出一种新颖的推荐系统,利用大语言模型识别具有意外却符合用户兴趣特点的项目,通过计算意外惊喜潜力对推荐内容重新排序,旨在为用户创造惊喜体验。
English Summary: This paper introduces a novel recommendation system that uses LLMs to identify items with unexpected yet personally relevant aspects, aiming to create serendipitous experiences for users by reranking recommendations based on calculated serendipity potential.

Authors:Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang
Title: BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
Abstract:
Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason
中文: BioReason创新性地整合了DNA基础模型与大语言模型,实现了多步骤生物推理,在基准测试中性能提升15%,并提供可解释的逐步分析,推动基因组数据的深度理解和假设生成。
English: BioReason is a novel AI architecture that integrates a DNA foundation model with a Large Language Model, enabling advanced multi-step biological reasoning and achieving a 15% performance improvement on benchmarks while providing interpretable, step-by-step explanations.

Authors:Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao
Title: Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Abstract:
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER
中文: Uni-MuMER通过全微调预训练视觉语言模型,结合树感知思维链、错误驱动学习和符号计数三项任务,在手写数学表达式识别任务中实现了最先进的性能表现。
English: Uni-MuMER introduces a novel approach by fully fine-tuning a pretrained vision-language model for handwritten mathematical expression recognition, integrating three data-driven tasks to achieve state-of-the-art performance on benchmark datasets.

Authors:Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
Title: Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Abstract:
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
中文摘要:分段策略优化(SPO)是一种新颖的强化学习框架,通过引入分段级优势估计克服了词元级和轨迹级方法的局限,在不依赖评论家模型的情况下实现了更优的推理性能。
English Summary: Segment Policy Optimization (SPO) is a novel reinforcement learning framework that introduces segment-level advantage estimation to overcome the limitations of token-level and trajectory-level methods, achieving superior reasoning performance without requiring a critic model.

Authors:Kunlun Zhu, Jiaxun Zhang, Ziheng Qi, Nuoxing Shang, Zijia Liu, Peixuan Han, Yue Su, Haofei Yu, Jiaxuan You
Title: SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents
Abstract:
Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce \textbf{SafeScientist}, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose \textbf{SciSafetyBench}, a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35\% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab-uiuc/SafeScientist. \textcolor{red}{Warning: this paper contains example data that may be offensive or harmful.}
中文摘要:SafeScientist框架通过整合多重防御机制并拒绝不道德任务,显著提升AI科研安全性达35%且不影响研究质量,其有效性已通过SciSafetyBench基准和对抗性测试验证。
English Summary: The SafeScientist framework enhances AI-driven scientific safety by integrating multiple defensive mechanisms and refusing unethical tasks, achieving 35% higher safety performance while maintaining research quality, with its effectiveness validated through SciSafetyBench and adversarial testing.

Authors:Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, Weiping Li
Title: Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information
Abstract:
Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again
中文摘要:Qwen-LookAgain是一种新型视觉语言推理模型,通过引入视觉-文本反思过程和强化学习机制,引导模型在推理过程中重新关注视觉信息,从而有效减少幻觉现象,在多项视觉问答基准测试中取得领先准确率。
English Summary: Qwen-LookAgain is a novel vision-language reasoning model that mitigates hallucinations by incorporating a vision-text reflection process and reinforcement learning to guide the model in re-attending to visual information during reasoning, achieving leading accuracy on multiple benchmarks.

Authors:Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
Title: Understanding Refusal in Language Models with Sparse Autoencoders
Abstract:
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.
Chinese: 本研究利用稀疏自编码器识别并验证了对齐语言模型中因果介导拒绝行为的潜在特征,从而实现了对拒绝机制的精细分析及其在提升泛化能力和理解对抗性越狱中的应用。
English: This study uses sparse autoencoders to identify and validate latent features that causally mediate refusal behaviors in aligned language models, enabling fine-grained analysis of refusal mechanisms and their applications in improving generalization and understanding adversarial jailbreaking.

Authors:Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
Title: Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Abstract:
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
中文: 提出的概率一致性偏好优化(PCPO)框架通过结合答案正确性和标记级概率一致性来增强大语言模型的数学推理能力,在多种模型和基准测试中均优于现有方法。
English: The proposed Probability-Consistent Preference Optimization (PCPO) framework enhances mathematical reasoning in LLMs by incorporating both answer correctness and token-level probability consistency, outperforming existing methods across various models and benchmarks.

Authors:Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu
Title: Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization
Abstract:
Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.
中文: 本文提出了一种基于人类偏好的扩散框架,通过优化感知对齐和解决时空分辨率不匹配问题,显著提升了肖像动画的唇音同步精度和运动连贯性。
English: This paper introduces a human-preference-aligned diffusion framework that enhances portrait animation by optimizing perceptual alignment and resolving spatiotemporal mismatches, achieving superior lip synchronization and motion fidelity.

Authors:Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun
Title: VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
Abstract:
Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.
中文摘要:VAU-R1框架基于多模态大语言模型和强化微调技术,显著提升了视频异常推理的准确性和可解释性,同时VAU-Bench基准测试为该领域提供了首个思维链评估标准。
English Summary: The VAU-R1 framework, leveraging Multimodal Large Language Models and Reinforcement Fine-Tuning, significantly enhances video anomaly reasoning accuracy and interpretability, while the VAU-Bench benchmark provides comprehensive evaluation tools for this domain.

Authors:Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin
Title: VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
Abstract:
Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.
中文摘要:VCapsBench作为首个大规模细粒度视频描述评估基准,包含5,677个视频和109,796组问答对,通过21个维度的系统标注和自动化评估指标,为文本生成视频任务提供可操作的优化指导。
English Summary: VCapsBench is introduced as the first large-scale fine-grained benchmark for evaluating video captions, featuring 5,677 videos and 109,796 QA-pairs across 21 dimensions to enhance text-to-video generation through actionable insights and automated metrics.

Authors:Ron Shapira Weber, Shahar Ben Ishay, Andrey Lavrinenko, Shahaf E. Finder, Oren Freifeld
Title: TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning
Abstract:
Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint
Chinese: TimePoint是一种自监督方法,通过从合成数据中学习关键点来加速动态时间规整(DTW)对齐,相比标准DTW实现了更快速度和更高精度。
English: TimePoint is a self-supervised method that accelerates Dynamic Time Warping (DTW) alignment by learning keypoints from synthetic data, achieving faster and more accurate results than standard DTW.

Authors:Jun Yang, Cheng-Chi Wang, Bogdan Alexandru Stoica, Kexin Pei
Title: Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency
Abstract:
Large Language Models (LLMs) have been increasingly used to optimize code efficiency. Evaluating their effectiveness and further suggesting optimization opportunities often rely on high-quality tests to demonstrate the performance bottlenecks presented in the program. However, existing approaches rely on a limited set of hand-curated inputs or LLM-generated uninteresting length-stressing tests, failing to reveal more nuanced optimization opportunities. We present WEDGE, a framework for generating performance-stressing input given the program under test. WEDGE synthesizes explicit performance-characterizing constraints in the form of branch conditions to partition the programs' execution space into performance-specific regions. When integrated with the coverage-guided fuzzer, reaching different regions introduces explicit rewards for test generation to explore inefficient implementations. Our evaluation shows that WEDGE introduces a significant slowdown compared to the tests in CodeContests and those claimed to be optimized by existing approaches. From the utility perspective, integrating our tests substantially improves the existing code optimization approaches that rely on test-driven execution feedback. We release PERFFORGE, the performance tests generated by WEDGE, to benchmark future approaches for efficient code generation at https://github.com/UChiSeclab/perfforge.
中文:WEDGE框架通过合成分支条件生成性能压力测试,以探索低效代码区域,显著改进了代码优化方法并优于现有技术。
English: WEDGE is a framework that generates performance-stressing tests by synthesizing branch conditions to explore inefficient code regions, significantly improving code optimization approaches and outperforming existing methods.

Authors:Zhuodong Li, Fei Hou, Wencheng Wang, Xuequan Lu, Ying He
Title: A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization
Abstract:
Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO's strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at https://github.com/zd-lee/DACPO.
中文: DACPO是一种新颖的分治框架,通过将大规模非封闭点云分割为可处理块进行局部定向,再采用基于图的全局优化确保一致性,有效解决了复杂三维场景定向的难题。
English: DACPO is a novel divide-and-conquer framework that segments large-scale non-watertight point clouds into manageable blocks for local orientation processing, then achieves global consistency through graph-based optimization to address the challenging orientation of complex 3D scenes.

Authors:Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
Title: SWE-bench Goes Live!
Abstract:
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
中文: SWE-bench-Live作为可实时更新的基准被提出,旨在克服静态基准的局限,通过自动化流程和Docker环境实现可复现评估,用于测试大语言模型处理真实GitHub问题的能力。
English: SWE-bench-Live is introduced as a live-updatable benchmark to address the limitations of static benchmarks like SWE-bench, featuring automated curation and Docker-based reproducibility for evaluating LLMs on real GitHub issues.

Authors:Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song
Title: KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Abstract:
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
KVzip is a query-agnostic KV cache eviction method that reduces memory overhead and attention latency by compressing and selectively removing less important key-value pairs, achieving up to 4× cache reduction and 2× faster decoding with minimal performance loss across diverse tasks.
English Summary:

Authors:Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Title: UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Abstract:
Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.
中文: UniRL是一种自我改进的后训练方法,它让多模态模型无需外部数据即可生成图像进行训练,不仅提升生成与理解任务的表现,还缩小了二者之间的性能差距。
English: UniRL is a self-improving post-training method that enables multimodal models to generate images for training without external data, improving both generation and understanding tasks while reducing their performance gap.

Authors:Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao
Title: Discriminative Policy Optimization for Token-Level Reward Models
Abstract:
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.
中文: Q-RM模型通过将奖励建模与语言生成解耦来优化细粒度奖励分配,在多项基准测试中均显著超越基线方法,有效提升推理能力和训练效率。
English: The Q-RM model, which decouples reward modeling from language generation to optimize token-level credit assignment, consistently outperforms baseline methods in enhancing reasoning capabilities and training efficiency across various benchmarks.

Authors:Alexandra G. Roberts, Ha M. Luu, Mert Şişman, Alexey V. Dimov, Ceren Tozlu, Ilhami Kovanlikaya, Susan A. Gauthier, Thanh D. Nguyen, Yi Wang
Title: Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis
Abstract:
Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between "rim" lesions, surrounded by deposited paramagnetic iron, and "non-rim" lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at https://github.com/agr78/PRLx-GAN upon publication.
中文摘要:定量磁化率图通过生成合成数据和采用新型去噪方法,改进了多发性硬化症中顺磁性边缘病变的检测与分割,从而提升了分类器性能和临床可解释性。
English Summary: Quantitative susceptibility maps enhance the detection and segmentation of paramagnetic rim lesions in multiple sclerosis by generating synthetic data and employing a novel denoising method, which improves classifier performance and clinical interpretability.

Authors:Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton, Sam Kirkham
Title: Nosey: Open-source hardware for acoustic nasalance
Abstract:
We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.
中文: Nosey是一种开源、可3D打印的鼻音度记录系统,作为商业设备的低成本、可定制替代方案,尽管其鼻音度得分整体较高,但在区分语音环境对比方面表现出与商业设备相当的性能。
English: Nosey is an open-source, 3D-printed nasalance recording system that offers a cost-effective and customizable alternative to commercial devices, demonstrating comparable performance in distinguishing phonological contrasts despite higher overall nasalance scores.

Authors:Weizhe Kong, Xiao Wang, Ruichong Gao, Chenglong Li, Yu Zhang, Xing Yang, Yaowei Wang, Jin Tang
Title: Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition
Abstract:
Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.
中文摘要:本文针对行人属性识别首次提出对抗攻击与防御框架,通过多模态Transformer和CLIP嵌入技术开发攻击策略及语义偏移防御方法,在数字与物理场景下的实验验证了其有效性。
English Summary: This paper introduces the first adversarial attack and defense framework for pedestrian attribute recognition, utilizing multimodal Transformers and CLIP-based embeddings to develop both attack strategies and a semantic offset defense method validated across digital and physical domains.

Authors:James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng
Title: How Does Response Length Affect Long-Form Factuality
Abstract:
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
中文: 本研究揭示大型语言模型的生成长度与事实准确性呈负相关,主要原因是知识耗尽导致可靠信息逐渐减少,而非错误传播或长上下文问题。
English: This study reveals that longer responses from large language models exhibit lower factual precision due to facts exhaustion, where the model gradually depletes its reliable knowledge, rather than error propagation or long context issues.

Authors:Xinye Li, Zunwen Zheng, Qian Zhang, Dekai Zhuang, Jiabao Kang, Liyan Xu, Qingbin Liu, Xi Chen, Zhiying Tu, Dianhui Chu, Dianbo Sui
Title: ScEdit: Script-based Assessment of Knowledge Editing
Abstract:
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
Chinese: 当前知识编辑方法在简单任务上表现优异,但在实际应用中面临挑战,为此我们推出了ScEdit新基准,综合评估事实性和行动性编辑,发现所有方法性能均下降。
English: Current knowledge editing methods perform well on simple tasks but struggle in real-world applications, prompting the introduction of ScEdit, a new benchmark that evaluates both factual and action-based edits and reveals performance drops across all methods.

Authors:Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, Lei Li
Title: Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
Abstract:
In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.
中文: 本文提出Wav2Sem即插即用模块,通过语义信息解耦音频特征,有效缓解语音驱动面部动画中近音节的耦合效应,从而提升唇形生成的精确度与自然度。
English: This paper introduces Wav2Sem, a plug-and-play module that enhances 3D speech-driven facial animation by decorrelating audio features with semantic information, effectively reducing the averaging effect of phonetically similar syllables for more precise and natural lip movements.

Authors:Chuandong Liu, Huijiao Wang, Lei Yu, Gui-Song Xia
Title: Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting
Abstract:
Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at https://github.com/azhuantou/MixGS.
Chinese: 提出的MixGS框架通过视图感知表示和混合操作,整体优化相机位姿与高斯属性,解决了大规模3D场景重建中的全局信息丢失问题,在单GPU上以更低计算成本实现了最优渲染质量。
English: The proposed MixGS framework overcomes limitations in large-scale 3D scene reconstruction by holistically optimizing camera poses and Gaussian attributes through a view-aware representation and mixing operation, achieving state-of-the-art rendering quality with reduced computational demands on a single GPU.

Authors:Yong Zhang, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao
Title: Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.
Sentinel introduces a lightweight framework that uses attention signals from a small proxy LLM to compress retrieved passages effectively, achieving high compression ratios while maintaining QA performance without needing dedicated training.
English Summary:

Authors:Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang
Title: LADA: Scalable Label-Specific CLIP Adapter for Continual Learning
Abstract:
Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.
中文: LADA通过向冻结的CLIP图像编码器添加标签特定记忆单元,实现判别性特征生成,并利用特征蒸馏防止灾难性遗忘,在持续学习中达到最优性能。
English: LADA introduces label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation and preventing catastrophic forgetting through feature distillation, achieving state-of-the-art continual learning performance.

Authors:Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, Ping Tan
Title: UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
Abstract:
We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.
中文: UniTEX提出了一种新颖的两阶段框架,通过在统一的三维功能空间中直接操作,利用纹理函数和基于Transformer的大型纹理模型绕过传统UV映射的限制,相比现有方法实现了更优越的纹理生成质量。
English: UniTEX introduces a novel two-stage framework for generating high-quality 3D textures by operating directly in a unified 3D functional space, bypassing traditional UV mapping limitations through Texture Functions and a transformer-based Large Texturing Model, achieving superior results compared to existing methods.

Authors:Wanfu Gao, Jun Gao, Qingqi Han, Hanlin Pan, Kunpeng Liu
Title: Graph Random Walk with Feature-Label Space Alignment: A Multi-Label Feature Selection Method
Abstract:
The rapid growth in feature dimension may introduce implicit associations between features and labels in multi-label datasets, making the relationships between features and labels increasingly complex. Moreover, existing methods often adopt low-dimensional linear decomposition to explore the associations between features and labels. However, linear decomposition struggles to capture complex nonlinear associations and may lead to misalignment between the feature space and the label space. To address these two critical challenges, we propose innovative solutions. First, we design a random walk graph that integrates feature-feature, label-label, and feature-label relationships to accurately capture nonlinear and implicit indirect associations, while optimizing the latent representations of associations between features and labels after low-rank decomposition. Second, we align the variable spaces by leveraging low-dimensional representation coefficients, while preserving the manifold structure between the original high-dimensional multi-label data and the low-dimensional representation space. Extensive experiments and ablation studies conducted on seven benchmark datasets and three representative datasets using various evaluation metrics demonstrate the superiority of the proposed method\footnote{Code: https://github.com/Heilong623/-GRW-}.
中文总结:该方法通过设计融合特征与标签关系的随机游走图来捕捉非线性关联,并利用低维表示系数对齐变量空间,有效解决了多标签数据中特征与标签的复杂关联问题。
English Summary: The proposed method addresses complex nonlinear feature-label associations in multi-label datasets by designing a random walk graph that captures implicit relationships and aligns variable spaces while preserving manifold structures.

Authors:Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, Shibiao Xu
Title: SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.
Chinese: SAMamba是一种创新框架,融合了分层特征学习与选择性序列建模,通过弥合领域差异、保留细节并高效建模长程依赖,解决了红外小目标检测的核心难题,在基准数据集上实现了卓越性能。
English: SAMamba is a novel framework that combines hierarchical feature learning with selective sequence modeling to address infrared small target detection challenges by bridging domain gaps, preserving details, and efficiently modeling long-range dependencies, achieving superior performance on benchmark datasets.

Authors:Aldino Rizaldy, Richard Gloaguen, Fabian Ewald Fassnacht, Pedram Ghamisi
Title: HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers
Abstract:
Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: https://github.com/aldinorizaldy/hyperpointformer.
中文摘要:本研究提出了一种全三维方法,通过多模态遥感数据和双分支Transformer直接从点云学习几何与光谱特征,实现了具有竞争力的分类结果并能灵活生成三维预测。
English Summary: This study introduces a fully 3D-based method using multimodal remote sensing data and a dual-branch Transformer to learn geometric and spectral features directly from point clouds, achieving competitive classification results and enabling flexible 3D predictions.

Authors:Lifan Zhao, Yanyan Shen, Zhaoyang Liu, Xue Wang, Jiaji Deng
Title: Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning
Abstract:
Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This prune-then-finetune paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines. Source code is made publicly available at https://github.com/SJTU-DMTai/Prune-then-Finetune.
Chinese: 尺度定律推动了时间序列基础模型的发展以实现卓越的零样本预测,但微调后仍无法持续超越小型专业模型,因此提出结构化剪枝方法,通过聚焦相关参数显著提升预测性能。
English: Scaling laws drive the development of Time Series Foundation Models (TSFMs) for superior zero-shot forecasting, but fine-tuning them fails to consistently outperform smaller specialized models, leading to a proposed structured pruning method that enhances performance by focusing on relevant parameters.

Authors:Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li
Title: Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics
Abstract:
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, $A$ or $B$, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing $A$ and $B$ to non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of $AB$ introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at https://github.com/Leopold1423/non_zero_lora-icml25.
中文摘要:本研究通过理论分析和实验验证,发现对LoRA双矩阵进行非零初始化能提升其对次优学习率的鲁棒性,且不影响微调效果,从而证明无需严格从预训练模型开始微调。
English Summary: This study demonstrates that initializing both LoRA matrices with non-zero values enhances robustness to suboptimal learning rates without compromising fine-tuning performance, challenging the necessity of starting strictly from pretrained weights.

Authors:Junyi Guo, Jingxuan Zhang, Fangyu Wu, Huanda Lu, Qiufeng Wang, Wenmian Yang, Eng Gee Lim, Dongming Lu
Title: HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
Abstract:
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset are available at https://github.com/Maple498/HiGarment.
中文摘要:本文提出FS2RG任务,通过结合平面草图和文本指导生成逼真服装图像,并针对织物细节表现与多模态冲突问题,开发了包含跨模态语义增强与协调注意力机制的HiGarment框架。
English Summary: This paper introduces the FS2RG task for generating realistic garment images from flat sketches and text, addressing challenges in fabric detail representation and conflicting modality guidance through the HiGarment framework with multi-modal enhancement and harmonized attention mechanisms.

Authors:Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza
Title: Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Abstract:
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
中文: 本研究利用模型可解释性和不确定性量化探索了高效的词级质量评估方法,以检测翻译错误,揭示了无监督指标的潜力及有监督方法在标签不确定性下的局限性。
English: This study explores efficient methods for word-level quality estimation by leveraging model interpretability and uncertainty quantification to detect translation errors, revealing the potential of unsupervised metrics and the limitations of supervised approaches under label uncertainty.

Authors:Ping Wang, Lishun Wang, Gang Qu, Xiaodong Wang, Yulun Zhang, Xin Yuan
Title: Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging
Abstract:
Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll.
中文摘要:本文提出了一种新颖的近端展开网络,将即插即用方法的灵活性与展开方法的高精度和快速性相结合,实现了单一模型即可有效处理不同压缩比的单像素成像重建任务。
English Summary: This paper introduces a novel proximal unrolling network that combines the flexibility of plug-and-play methods with the superior accuracy and speed of unrolling approaches for single-pixel imaging, enabling a single model to handle varying compression ratios effectively.

Authors:Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao
Title: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
Abstract:
Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset
中文: Infinite-Instruct通过逆向构建和反馈构建自动生成逻辑严密的高质量代码指令数据,仅用少量数据即可大幅提升大语言模型的代码生成能力。
English: Infinite-Instruct is an automated framework that synthesizes high-quality, logically coherent code instruction data through reverse and backfeeding construction, significantly boosting LLMs' code generation performance with minimal data.

Authors:Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li
Title: The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning
Abstract:
To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available at https://github.com/Leopold1423/fedmud-icml25.
Chinese: 本文针对联邦学习中的低秩分解方法,提出了模型更新分解、分块克罗内克分解和聚合感知分解三种互补技术,在理论保证下实现了更快的收敛速度和更高的精度。
English: This paper introduces three complementary techniques—Model Update Decomposition, Block-wise Kronecker Decomposition, and Aggregation-Aware Decomposition—to enhance low-rank decomposition in federated learning, achieving faster convergence and higher accuracy with theoretical guarantees.

Authors:Shohei Enomoto
Title: Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization
Abstract:
Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.
中文摘要:本文提出的伪多源域泛化(PMDG)框架通过从单源数据生成合成多域数据集,解决了多源域泛化的实际应用限制,使MDG算法能在更现实的单源场景中有效运用。
English Summary: The proposed Pseudo Multi-source Domain Generalization (PMDG) framework addresses the limitations of multi-source domain generalization by generating synthetic multi-domain datasets from single-source data, enabling effective application of MDG algorithms in practical settings.

Authors:Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Title: PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Abstract:
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.
中文: 本文提出On-AVEP实时音视频事件解析方法,采用PreFM框架通过预测性未来建模增强上下文理解,以更少参数实现显著优于现有方法的性能。
English: This paper introduces On-AVEP, a real-time audio-visual event parsing method using the PreFM framework that enhances context through predictive future modeling and achieves superior performance with fewer parameters.

Authors:Jinquan Guan, Qi Chen, Lizhou Liang, Yuhang Liu, Vu Minh Hieu Phan, Minh-Son To, Jian Chen, Yutong Xie
Title: Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning
Abstract:
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).
中文: 该摘要介绍了CXRTrek数据集,它模拟放射科医生在胸片诊断中的多阶段推理过程,以及基于此构建的CXRTrekNet视觉语言模型,该模型通过融入临床推理流程,在多项任务中展现出优于现有模型的性能。
English: The abstract introduces CXRTrek, a multi-stage visual question answering dataset designed to simulate radiologists' diagnostic reasoning in chest X-ray interpretation, and CXRTrekNet, a vision-language model that incorporates clinical reasoning flow, demonstrating superior performance over existing models.

Authors:Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song
Title: VERINA: Benchmarking Verifiable Code Generation
Abstract:
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.
中文: 大语言模型在生成正确代码方面存在挑战,因此推出了Verina基准来评估可验证代码生成,揭示了当前模型在证明和规范能力上的显著不足。
English: Large language models face challenges in generating correct code, prompting the development of Verina, a benchmark for evaluating verifiable code generation that reveals significant gaps in current models' proof and specification capabilities.

Authors:Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu
Title: Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing
Abstract:
Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.
中文摘要:本文提出Zero-to-Hero方法,通过基于参考的视频编辑先将锚帧编辑为参考图像并利用对应关系引导注意力机制在帧间传播外观,再通过条件生成修复模型解决成像退化问题,相比最佳基线方法PSNR指标提升2.6分贝。
English Summary: This paper introduces Zero-to-Hero, a reference-based video editing method that first edits an anchor frame as reference and propagates its appearance across frames using correspondence-guided attention, then addresses imaging degradation through a conditional generative restoration model, achieving significant performance improvements over baselines.

Authors:Shi Heng Zhang, Zhengjie Miao, Jiannan Wang
Title: LINEAGEX: A Column Lineage Extraction System for SQL
Abstract:
As enterprise data grows in size and complexity, column-level data lineage, which records the creation, transformation, and reference of each column in the warehouse, has been the key to effective data governance that assists tasks like data quality monitoring, storage refactoring, and workflow migration. Unfortunately, existing systems introduce overheads by integration with query execution or fail to achieve satisfying accuracy for column lineage. In this paper, we demonstrate LINEAGEX, a lightweight Python library that infers column level lineage from SQL queries and visualizes it through an interactive interface. LINEAGEX achieves high coverage and accuracy for column lineage extraction by intelligently traversing query parse trees and handling ambiguities. The demonstration walks through use cases of building lineage graphs and troubleshooting data quality issues. LINEAGEX is open sourced at https://github.com/sfu-db/lineagex and our video demonstration is at https://youtu.be/5LaBBDDitlw
中文: LINEAGEX 是一个轻量级 Python 库,通过智能遍历查询解析树并处理歧义,能够从 SQL 查询中高精度推断列级数据血缘关系,并以交互式界面可视化,解决了现有系统在开销和准确性方面的不足。
English: LINEAGEX is a lightweight Python library that accurately infers column-level data lineage from SQL queries by intelligently traversing parse trees and visualizing it interactively, addressing the limitations of existing systems in overhead and accuracy for effective data governance.

Authors:Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han
Title: MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation
Abstract:
Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.
中文:MMGT模型通过两阶段处理,利用音频驱动的运动遮罩和特征生成同步的伴随语音手势视频,克服了仅依赖音频的局限,实现了高质量的上半身运动细节与精准同步。
English: The MMGT model uses audio-driven motion masks and features in a two-stage process to generate synchronized co-speech gesture videos, overcoming limitations of audio-only methods by producing high-quality upper-body movements with enhanced detail and synchronization.

Authors:Pengfei Zhou, Yunlong Liu, Junli Liang, Qi Song, Xiangyang Li
Title: CrossLinear: Plug-and-Play Cross-Correlation Embedding for Time Series Forecasting with Exogenous Variables
Abstract:
Time series forecasting with exogenous variables is a critical emerging paradigm that presents unique challenges in modeling dependencies between variables. Traditional models often struggle to differentiate between endogenous and exogenous variables, leading to inefficiencies and overfitting. In this paper, we introduce CrossLinear, a novel Linear-based forecasting model that addresses these challenges by incorporating a plug-and-play cross-correlation embedding module. This lightweight module captures the dependencies between variables with minimal computational cost and seamlessly integrates into existing neural networks. Specifically, it captures time-invariant and direct variable dependencies while disregarding time-varying or indirect dependencies, thereby mitigating the risk of overfitting in dependency modeling and contributing to consistent performance improvements. Furthermore, CrossLinear employs patch-wise processing and a global linear head to effectively capture both short-term and long-term temporal dependencies, further improving its forecasting precision. Extensive experiments on 12 real-world datasets demonstrate that CrossLinear achieves superior performance in both short-term and long-term forecasting tasks. The ablation study underscores the effectiveness of the cross-correlation embedding module. Additionally, the generalizability of this module makes it a valuable plug-in for various forecasting tasks across different domains. Codes are available at https://github.com/mumiao2000/CrossLinear.
Chinese: CrossLinear是一种基于线性预测的新模型,通过轻量化的互相关嵌入模块有效捕捉变量间依赖关系并防止过拟合,在多个数据集的短期和长期预测任务中均表现出卓越性能。
English: CrossLinear is a novel linear-based forecasting model that introduces a lightweight cross-correlation embedding module to efficiently capture dependencies between variables while mitigating overfitting, achieving superior performance in both short-term and long-term forecasting tasks across multiple datasets.

Authors:Ning Liu, Yue Yu
Title: Neural Interpretable PDEs: Harmonizing Fourier Insights with Attention for Scalable and Interpretable Physics Discovery
Abstract:
Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairs -- a task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning. Our code and data accompanying this paper are available at https://github.com/fishmoon1234/Nonlocal-Attention-Operator.
中文摘要:本文提出神经可解释偏微分方程(NIPS),这是一种通过线性注意力机制和傅里叶空间卷积增强非局部注意力算子的新架构,在物理学习基准测试中显著提升了预测精度与计算效率。
English Summary: The paper introduces Neural Interpretable PDEs (NIPS), a neural operator architecture that enhances Nonlocal Attention Operators with improved accuracy and efficiency through linear attention and Fourier-space convolution, outperforming existing methods in physics learning benchmarks.

Authors:Yuu Jinnai
Title: Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Abstract:
Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport
中文: 本文提出MBR-OT方法,通过引入Wasserstein距离改进最小贝叶斯风险解码,将句子级效用函数有效应用于文档级文本生成,在机器翻译、文本简化和密集图像描述任务中展现出优于标准方法的性能。
English: This paper introduces MBR-OT, an enhanced Minimum Bayes Risk decoding method using Wasserstein distance to improve document-level text generation by effectively applying sentence-level utility functions to longer contexts, demonstrating superior performance in translation, simplification, and captioning tasks.

Authors:Dohyeon Lee, Yeonseok Jeong, Seung-won Hwang
Title: From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
Abstract:
Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (Refine, Rerank, Stop) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that SMR improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning. The code and details are available at https://github.com/ldilab/SMR.
Chinese: 状态机推理(SMR)是一种基于状态转换的推理框架,通过减少冗余推理和降低令牌使用量来提升检索性能,为大型语言模型中的思维链提示提供了一种实用的替代方案。
English: State Machine Reasoning (SMR) is a transition-based framework that enhances retrieval performance by reducing redundant reasoning and token usage, offering a practical alternative to Chain-of-Thought prompting in large language models.

Authors:Zhen Xiang, Aliyah R. Hsu, Austin V. Zane, Aaron E. Kornblith, Margaret J. Lin-Martore, Jasmanpreet C. Kaur, Vasuda M. Dokiparthi, Bo Li, Bin Yu
Title: CDR-Agent: Intelligent Selection and Execution of Clinical Decision Rules Using Large Language Model Agents
Abstract:
Clinical decision-making is inherently complex and fast-paced, particularly in emergency departments (EDs) where critical, rapid and high-stakes decisions are made. Clinical Decision Rules (CDRs) are standardized evidence-based tools that combine signs, symptoms, and clinical variables into decision trees to make consistent and accurate diagnoses. CDR usage is often hindered by the clinician's cognitive load, limiting their ability to quickly recall and apply the appropriate rules. We introduce CDR-Agent, a novel LLM-based system designed to enhance ED decision-making by autonomously identifying and applying the most appropriate CDRs based on unstructured clinical notes. To validate CDR-Agent, we curated two novel ED datasets: synthetic and CDR-Bench, although CDR-Agent is applicable to non ED clinics. CDR-Agent achieves a 56.3\% (synthetic) and 8.7\% (CDR-Bench) accuracy gain relative to the standalone LLM baseline in CDR selection. Moreover, CDR-Agent significantly reduces computational overhead. Using these datasets, we demonstrated that CDR-Agent not only selects relevant CDRs efficiently, but makes cautious yet effective imaging decisions by minimizing unnecessary interventions while successfully identifying most positively diagnosed cases, outperforming traditional LLM prompting approaches. Code for our work can be found at: https://github.com/zhenxianglance/medagent-cdr-agent
中文: CDR-Agent是一种基于大语言模型的新型系统,通过自主从非结构化临床记录中筛选并应用合适的临床决策规则来提升急诊科决策水平,相比基准模型显著提高了准确性并降低了计算开销。
English: CDR-Agent is a novel LLM-based system that enhances emergency department decision-making by autonomously selecting and applying appropriate Clinical Decision Rules from unstructured clinical notes, achieving significant accuracy improvements over baseline models while reducing computational overhead.

Authors:Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian
Title: DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
Abstract:
Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.
Chinese: 本文提出DenoiseRotator方法,通过正交变换重新分配参数重要性,增强大语言模型剪枝的鲁棒性并显著减少性能损失。
English: This paper introduces DenoiseRotator, a model-agnostic method that redistributes parameter importance through orthogonal transformations to enhance pruning robustness and reduce performance degradation in large language models.

Authors:Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia
Title: Multi-Sourced Compositional Generalization in Visual Question Answering
Abstract:
Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.
中文: 本文提出了一种检索增强训练框架,通过为跨模态基元学习统一表征来增强视觉问答模型的多源组合泛化能力,并基于新构建的GQA-MSCG数据集验证了其有效性。
English: This paper introduces a retrieval-augmented training framework to enhance multi-sourced compositional generalization in visual question answering models by learning unified representations for cross-modal primitives, validated through a newly constructed GQA-MSCG dataset.

Authors:Yihang Wu, Muhammad Owais, Reem Kateb, Ahmad Chaddad
Title: Deep Modeling and Optimization of Medical Image Classification
Abstract:
Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.
中文摘要:深度学习模型在图像分类中表现优异,但在医学领域面临数据稀缺和隐私问题,为此研究提出了一种新型CLIP变体,结合联邦学习和传统机器学习方法,有效提升了医学图像分析的性能与泛化能力。
English Summary: Deep models like CNNs and ViT excel in image classification but face data scarcity and privacy issues in medicine, prompting the development of a novel CLIP variant combined with federated learning and traditional ML to enhance performance and generalization in medical image analysis.

Authors:Si Wu, Sebastian Bruch
Title: Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Space
Abstract:
Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).
Chinese: 本文提出一种无监督的邻域稳定性度量方法,通过分析语义嵌入空间中词汇分布的峰值特征来有效评估形象性和具体性,其与人工评分的相关性优于现有方法。
English: This paper introduces an unsupervised Neighborhood Stability Measure (NSM) that effectively estimates imageability and concreteness by analyzing the peakedness of words in semantic embedding space, outperforming existing methods in correlation with human ratings.

Authors:Minh Nguyen Nhat To, Paul F RWilson, Viet Nguyen, Mohamed Harmanani, Michael Cooper, Fahimeh Fooladgar, Purang Abolmaesumi, Parvin Mousavi, Rahul G. Krishnan
Title: Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift
Abstract:
The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop
中文: 针对机器学习中的子群体分布偏移问题,本文提出的多样化原型集成方法通过用混合原型分类器替代标准分类器,自适应捕捉子群体风险,在多种数据集的最差组准确率上优于现有最优方法。
English: Subpopulation shift in machine learning models is addressed by the proposed Diverse Prototypical Ensembles method, which replaces standard classifiers with a mixture of prototypical classifiers to adaptively capture subpopulation risks and outperforms prior methods in worst-group accuracy across diverse datasets.

Authors:Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo
Title: Context-Robust Knowledge Editing for Language Models
Abstract:
Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.
Chinese: 知识编辑方法在前文语境触发原有知识时常常失效,为此我们开发了CHED基准评估语境鲁棒性,并提出了CoRE方法通过减少隐藏状态中的语境敏感方差来提高编辑成功率。
English: Knowledge editing methods often fail when preceding contexts trigger original knowledge, so we developed the CHED benchmark to evaluate context robustness and introduced the CoRE method to improve editing success by reducing context-sensitive variance in hidden states.

Authors:Haoqin Sun, Xuechen Wang, Jinghua Zhao, Shiwan Zhao, Jiaming Zhou, Hui Wang, Jiabei He, Aobo Kong, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Title: EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations
Abstract:
In recent years, emotion recognition plays a critical role in applications such as human-computer interaction, mental health monitoring, and sentiment analysis. While datasets for emotion analysis in languages such as English have proliferated, there remains a pressing need for high-quality, comprehensive datasets tailored to the unique linguistic, cultural, and multimodal characteristics of Chinese. In this work, we propose \textbf{EmotionTalk}, an interactive Chinese multimodal emotion dataset with rich annotations. This dataset provides multimodal information from 19 actors participating in dyadic conversational settings, incorporating acoustic, visual, and textual modalities. It includes 23.6 hours of speech (19,250 utterances), annotations for 7 utterance-level emotion categories (happy, surprise, sad, disgust, anger, fear, and neutral), 5-dimensional sentiment labels (negative, weakly negative, neutral, weakly positive, and positive) and 4-dimensional speech captions (speaker, speaking style, emotion and overall). The dataset is well-suited for research on unimodal and multimodal emotion recognition, missing modality challenges, and speech captioning tasks. To our knowledge, it represents the first high-quality and versatile Chinese dialogue multimodal emotion dataset, which is a valuable contribution to research on cross-cultural emotion analysis and recognition. Additionally, we conduct experiments on EmotionTalk to demonstrate the effectiveness and quality of the dataset. It will be open-source and freely available for all academic purposes. The dataset and codes will be made available at: https://github.com/NKU-HLT/EmotionTalk.
中文: 本文提出首个高质量多模态中文情感数据集EmotionTalk,该数据集包含丰富的多模态标注信息,旨在填补中文情感分析资源空白,推动跨文化情感识别研究。
English: This paper introduces EmotionTalk, the first high-quality multimodal Chinese emotion dataset featuring rich annotations across acoustic, visual, and textual modalities, designed to advance emotion recognition research and address the scarcity of culturally tailored resources.

Authors:Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Title: SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
Abstract:
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.
中文:提出的SeG-SR框架通过融合视觉语言模型的语义指导来提升遥感图像超分辨率性能,其设计的语义提取与调制模块能有效利用高层场景理解消除重建伪影,在多个数据集上实现了最优效果。
English: The proposed SeG-SR framework enhances remote sensing image super-resolution by integrating semantic guidance from vision-language models, achieving state-of-the-art performance through modules that extract and utilize high-level scene understanding to prevent artifacts.

Authors:Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola
Title: EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Abstract:
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.
中文:EmergentTTS-Eval基准通过自动化测试生成与评估,覆盖六种复杂TTS场景,利用大语言模型构建多样化测试用例,并采用音频大模型进行多维度评估,其评判结果与人类偏好高度一致。
English: The EmergentTTS-Eval benchmark addresses limitations in existing TTS evaluations by automating test generation and assessment across six complex scenarios, using LLMs to create diverse cases and a Large Audio Language Model for multidimensional evaluation that correlates well with human judgment.

Authors:Peixuan Han, Zijia Liu, Jiaxuan You
Title: ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
Abstract:
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.
Chinese: 针对当前大型语言模型在说服任务中缺乏心理理论推理能力的问题,研究者提出了ToMAP方法,通过心理理论模块和强化学习增强对对手心理状态的分析,仅用30亿参数就在多项指标上显著超越了GPT-4o等更大模型。
English: To address the limitations of current LLMs in Theory of Mind reasoning for persuasion, the researchers developed ToMAP, a method that enhances opponent awareness through specialized modules and reinforcement learning, achieving superior performance over larger models like GPT-4o with only 3B parameters.

Authors:Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune
Title: Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Abstract:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.
中文: 达尔文·哥德尔机是一种自我改进的人工智能系统,它通过迭代式代码修改和实证验证来自主提升编程能力,在基准测试中实现了显著性能提升,同时融入了安全防护措施。
English: The Darwin Gödel Machine (DGM) is a self-improving AI system that autonomously enhances its coding capabilities through iterative code modifications and empirical validation, achieving significant performance gains on benchmarks while incorporating safety measures.

Authors:Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune
Title: Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Abstract:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.
中文: 达尔文·哥德尔机是一种自我改进的人工智能系统,它通过迭代式代码修改和实证验证来自主提升编程能力,在基准测试中实现了显著性能提升,同时融入了安全防护措施。
English: The Darwin Gödel Machine (DGM) is a self-improving AI system that autonomously enhances its coding capabilities through iterative code modifications and empirical validation, achieving significant performance gains on benchmarks while incorporating safety measures.

Authors:Michael Sun, Orion Foo, Gang Liu, Wojciech Matusik, Jie Chen
Title: Directed Graph Grammars for Sequence-based Learning
Abstract:
Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data. Code is available at https://github.com/shiningsunnyday/induction.
中文: 本文提出了一种基于语法的方法,通过将有向无环图视为无歧义语法的推导,构建出原则性、紧凑的序列表示,可用于生成建模、属性预测和贝叶斯优化等应用。
English: This paper introduces a grammar-based method to create a principled, compact sequential representation of directed acyclic graphs (DAGs) by treating them as derivations from an unambiguous grammar, enabling applications in generative modeling, property prediction, and Bayesian optimization.

Authors:Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Jie Chen
Title: Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages
Abstract:
Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available at https://github.com/shiningsunnyday/induction.
中文:基础分子语法(FMG)利用多模态基础模型将分子表示为图像和文本,从而创建了一种可解释的分子语言,在分子生成和性质预测方面表现出卓越的可合成性、多样性和数据效率。
English: Foundation Molecular Grammar (FMG) introduces an interpretable molecular language by leveraging multi-modal foundation models to represent molecules as images and text, excelling in synthesizability, diversity, and data efficiency for molecular generation and property prediction.

Authors:Guilherme Adamatti Bridi, André Luis Alves Martins, Franklin de Lima Marquezino, Celina Miraglia Herrera de Figueiredo
Title: The only Class 0 Flower snark is the smallest
Abstract:
Graph pebbling is a game played on graphs with pebbles on their vertices. A pebbling move removes two pebbles from one vertex and places one pebble on an adjacent vertex. The pebbling number is the smallest $t$ so that from any initial configuration of $t$ pebbles it is possible, after a sequence of pebbling moves, to place a pebble on any given target vertex. Graphs whose pebbling number is equal to the number of vertices are called Class~$0$ and provide a challenging set of graphs that resist being characterized. In this note, we answer a question recently proposed by the pioneering study on the pebbling number of snark graphs: we prove that the smallest Flower snark $J_3$ is Class~$0$, establishing that $J_3$ is in fact the only Class~$0$ Flower snark.
中文摘要:本研究证实了最小的花怪图J₃属于Class 0类,其卵石数等于顶点数,并确定它是花怪图中唯一具有此性质的图。
English Summary: The study confirms that the smallest Flower snark graph, J₃, is Class 0, meaning its pebbling number equals its vertex count, and uniquely identifies it as the only Class 0 member among Flower snarks.

Authors:Ben Weiss
Title: Fast Isotropic Median Filtering
Abstract:
Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.
中文: 该方法首次突破了以往中值滤波算法的局限,能高效处理任意位深、核尺寸及凸形核(包括圆形),避免了方形核产生的条纹伪影。
English: The proposed method overcomes the limitations of previous median filtering algorithms by efficiently handling arbitrary bit depths, kernel sizes, and convex shapes, including circular ones, without the artifacts associated with square kernels.

Authors:Ruichen Chen
Title: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Abstract:
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}
中文: Re-ttention通过利用历史softmax分布重塑注意力分数,在极高稀疏度下保持完整注意力的视觉质量,同时实现超过45%的端到端延迟降低。
English: Re-ttention enables high-quality visual generation with extremely sparse attention by reshaping scores using prior softmax distributions, achieving over 45% latency reduction while maintaining full-attention quality.

Authors:Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari
Title: Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Abstract:
We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache upto 45% of dense inference and thereby enables longer context length and increased tokens/sec throughput of upto 2.23x compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar.
中文摘要:非结构化稀疏性通过基于幅度的逐令牌剪枝和定制稀疏注意力内核,可在不损失精度的情况下将LLM的KV缓存压缩高达70%,实现2.23倍吞吐量提升并支持更长上下文。
English Summary: Unstructured sparsity enables up to 70% KV cache compression in LLMs without accuracy loss, using per-token magnitude pruning and a custom sparse attention kernel to achieve 2.23x throughput and support longer contexts.

Authors:Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Title: When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
Abstract:
Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.
中文摘要:近期大型推理模型在英语任务中表现优异,但在多语言推理方面存在明显不足,常出现语言回退或逻辑碎片化问题,需在答案准确性与推理可读性之间权衡,而针对性训练可部分缓解这一矛盾。
English Summary: Recent large reasoning models demonstrate strong performance in English but struggle with multilingual reasoning, often reverting to English or producing fragmented logic in other languages, revealing a significant capability gap that requires balancing between answer accuracy and reasoning trace readability.

Authors:Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Title: When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy
Abstract:
Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.
中文摘要:近期大型推理模型在英语任务中表现优异,但在多语言推理方面存在明显不足,常出现语言回退或逻辑碎片化问题,需在答案准确性与推理可读性之间权衡,而针对性训练可部分缓解这一矛盾。
English Summary: Recent large reasoning models demonstrate strong performance in English but struggle with multilingual reasoning, often reverting to English or producing fragmented logic in other languages, revealing a significant capability gap that requires balancing between answer accuracy and reasoning trace readability.

Authors:Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Title: CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models
Abstract:
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
中文: CFP-Gen是一种扩散语言模型,通过整合多模态功能、序列和结构约束,能够生成具有天然蛋白质功能的新型蛋白质,并在多功能设计上实现高成功率。
English: CFP-Gen is a diffusion language model that integrates multimodal functional, sequence, and structural constraints to generate novel proteins with natural-like functionality and a high success rate in multifunctional designs.

Authors:Iknoor Singh, Carolina Scarton, Kalina Bontcheva
Title: GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification
Abstract:
The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.
中文摘要:本文提出的H3Prompt方法采用分层三步提示策略,利用大语言模型对多语言新闻进行叙事分类,在SemEval 2025评测中荣获全球第一名。
English Summary: This paper introduces H3Prompt, a hierarchical three-step prompting method using Large Language Models for multilingual narrative classification of news articles, which achieved first place in the SemEval 2025 competition.

Authors:Yupei Li, Shuaijie Shao, Manuel Milling, Björn W. Schuller
Title: Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge
Abstract:
Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git
中文摘要:本研究首次将大语言模型应用于多模态抑郁症检测,通过结合Wav2Vec音频特征与心理学知识增强策略,在诊断准确性上较基线分数实现了显著提升。
English Summary: This study introduces the first multimodal depression detection method using large language models (LLMs) combined with audio features from Wav2Vec and psychological knowledge integration, achieving significant improvements in diagnostic accuracy over baseline scores.

Authors:Kostas Triaridis, Panagiotis Kaliosis, E-Ro Nguyen, Jingyi Xu, Hieu Le, Dimitris Samaras
Title: Improving Contrastive Learning for Referring Expression Counting
Abstract:
Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.
中文: 物体计数从类别特定模型发展到类别无关模型,当前面临指代表达式计数(REC)的挑战,我们提出的C-REX框架通过完全在图像空间运行的对比学习方法,增强了判别性表征学习和鲁棒性,从而实现了最先进的性能。
English: Object counting has evolved from class-specific to class-agnostic models, and now faces the challenge of Referring Expression Counting (REC), which our proposed C-REX framework addresses with a contrastive learning approach that operates entirely in the image space, achieving state-of-the-art results by improving discriminative representation learning and robustness.

Authors:Zhangyi Hu, Jiemin Wu, Hua Xu, Mingqian Liao, Ninghui Feng, Bo Gao, Songning Lai, Yutao Yue
Title: IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction
Abstract:
Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.
中文摘要:VIMTS框架通过将不规则多元时间序列处理为等间隔特征补丁,利用跨通道依赖关系和自监督学习,有效解决了多通道信号不对齐和缺失数据带来的预测挑战,显著提升了预测性能。
English Summary: The VIMTS framework adapts the Visual Mask AutoEncoder to effectively forecast Irregular Multivariate Time Series by processing data into patches, leveraging cross-channel dependencies, and employing self-supervised learning to overcome challenges posed by missing values and unaligned signals.

Authors:Andrew Zhu, Evan Osgood, Chris Callison-Burch
Title: First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
Abstract:
Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.
中文: 本文提出了“旁听智能体”的新范式,通过《龙与地下城》案例展示了多模态音频语言模型如何被动监听人类对话以执行后台任务或提供辅助,并开源相关代码以推动该领域研究。
English: This paper introduces "overhearing agents," a novel paradigm where LLM agents passively monitor human conversations to perform background tasks or offer assistance, demonstrated through a Dungeons & Dragons case study using multimodal audio-language models and released with open-source code for further research.

Authors:Anton Björklund, Mykola Zaitsev, Marta Kwiatkowska
Title: Efficient Preimage Approximation for Neural Network Certification
Abstract:
The growing reliance on artificial intelligence in safety- and security-critical applications demands effective neural network certification. A challenging real-world use case is "patch attacks", where adversarial patches or lighting conditions obscure parts of images, for example, traffic signs. A significant step towards certification against patch attacks was recently achieved using PREMAP, which uses under- and over-approximations of the preimage, the set of inputs that lead to a specified output, for the certification. While the PREMAP approach is versatile, it is currently limited to fully-connected neural networks of moderate dimensionality. In order to tackle broader real-world use cases, we present novel algorithmic extensions to PREMAP involving tighter bounds, adaptive Monte Carlo sampling, and improved branching heuristics. Firstly, we demonstrate that these efficiency improvements significantly outperform the original PREMAP and enable scaling to convolutional neural networks that were previously intractable. Secondly, we showcase the potential of preimage approximation methodology for analysing and certifying reliability and robustness on a range of use cases from computer vision and control.
中文: 该摘要介绍了对PREMAP方法的算法扩展,通过优化边界、自适应采样和分支策略,实现了对卷积神经网络的高效认证,并在计算机视觉和控制用例中展现出更强的抗补丁攻击鲁棒性。
English: This abstract introduces enhanced algorithmic extensions to the PREMAP method, enabling more efficient neural network certification against patch attacks by scaling to convolutional networks and demonstrating improved robustness in computer vision and control applications.

Authors:Mert Onur Cakiroglu, Idil Bilge Altun, Mehmet Dalkilic, Elham Buxton, Hasan Kurban
Title: Multivariate de Bruijn Graphs: A Symbolic Graph Framework for Time Series Forecasting
Abstract:
Time series forecasting remains a challenging task for foundation models due to temporal heterogeneity, high dimensionality, and the lack of inherent symbolic structure. In this work, we propose DRAGON (Discrete Representation and Augmented Graph encoding Over de BruijN Graphs), a novel encoder that introduces Multivariate de Bruijn Graphs (MdBGs) to bridge the gap between symbolic representations and neural modeling. DRAGON discretizes continuous input sequences and maps them onto a fixed graph structure, enabling dynamic context recovery via graph-based attention. Integrated as an auxiliary module within a dual-branch architecture, DRAGON augments conventional CNN-based encoders with symbolic, structure-aware representations. All code developed for this study is available at: https://github.com/KurbanIntelligenceLab/MultdBG-Time-Series-Library
中文:DRAGON编码器通过多元德布鲁因图将符号表示与神经建模相结合,在双分支架构中离散化序列并利用基于图的注意力增强传统CNN编码器,以解决时间序列预测的挑战。
English: The DRAGON encoder introduces Multivariate de Bruijn Graphs to bridge symbolic representations with neural modeling for time series forecasting, discretizing sequences and enhancing CNN-based encoders with graph-based attention in a dual-branch architecture.

Authors:Marco Colussi, Dragan Ahmetovic, Sergio Mascetti
Title: MIAS-SAM: Medical Image Anomaly Segmentation without thresholding
Abstract:
This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM
中文: MIAS-SAM提出了一种基于图像块记忆库的新方法,利用SAM编码器从正常数据提取特征进行医学图像异常区域分割,无需设定阈值即可在多种成像模态中实现精准分割。
English: MIAS-SAM introduces a patch-based memory bank approach using SAM encoder features from normal data to segment anomalies in medical images without thresholding, achieving high accuracy across multiple imaging modalities.

Authors:Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis
Title: Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Abstract:
Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
中文摘要:强化学习方法如GRPO主要提升了大型语言模型在数学推理中的执行稳健性,但由于规划能力不足而面临“覆盖墙”的局限,不过控制实验表明通过改进探索机制可能找到突破这一障碍的潜在路径。
English Summary: Reinforcement learning methods like GRPO primarily enhance LLMs' execution robustness in mathematical reasoning but face a 'coverage wall' due to insufficient planning skills, though controlled experiments suggest potential pathways to overcome this limitation through improved exploration.

Authors:Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov
Title: Climate Finance Bench
Abstract:
Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
中文摘要:Climate Finance Bench推出一个针对企业气候披露的开放问答基准,通过专家验证数据集揭示检索准确性是主要性能瓶颈,并倡导在气候AI应用中采用量化权重等透明碳报告技术。
English Summary: Climate Finance Bench introduces an open benchmark for evaluating LLM-based question-answering on corporate climate reports, identifying retrieval accuracy as the key performance bottleneck while advocating for transparent carbon reporting in AI applications.

Authors:Tamas Spisak, Karl Friston
Title: Self-orthogonalizing attractor neural networks emerging from the free energy principle
Abstract:
Attractor dynamics are a hallmark of many complex systems, including the brain. Understanding how such self-organizing dynamics emerge from first principles is crucial for advancing our understanding of neuronal computations and the design of artificial intelligence systems. Here we formalize how attractor networks emerge from the free energy principle applied to a universal partitioning of random dynamical systems. Our approach obviates the need for explicitly imposed learning and inference rules and identifies emergent, but efficient and biologically plausible inference and learning dynamics for such self-organizing systems. These result in a collective, multi-level Bayesian active inference process. Attractors on the free energy landscape encode prior beliefs; inference integrates sensory data into posterior beliefs; and learning fine-tunes couplings to minimize long-term surprise. Analytically and via simulations, we establish that the proposed networks favor approximately orthogonalized attractor representations, a consequence of simultaneously optimizing predictive accuracy and model complexity. These attractors efficiently span the input subspace, enhancing generalization and the mutual information between hidden causes and observable effects. Furthermore, while random data presentation leads to symmetric and sparse couplings, sequential data fosters asymmetric couplings and non-equilibrium steady-state dynamics, offering a natural extension to conventional Boltzmann Machines. Our findings offer a unifying theory of self-organizing attractor networks, providing novel insights for AI and neuroscience.
中文摘要:本研究基于自由能原理构建了吸引子网络的形成机制,揭示了通过吸引子编码信念并优化表征的自组织系统,能够执行贝叶斯主动推理并提升泛化能力。
English Summary: This study formalizes how attractor networks emerge from the free energy principle, revealing self-organizing systems that perform Bayesian active inference through attractors encoding beliefs and optimizing representations for enhanced generalization.

Authors:Yannick Stade, Wan-Hsuan Lin, Jason Cong, Robert Wille
Title: Routing-Aware Placement for Zoned Neutral Atom-based Quantum Computing
Abstract:
Quantum computing promises to solve previously intractable problems, with neutral atoms emerging as a promising technology. Zoned neutral atom architectures allow for immense parallelism and higher coherence times by shielding idling atoms from interference with laser beams. However, in addition to hardware, successful quantum computation requires sophisticated software support, particularly compilers that optimize quantum algorithms for hardware execution. In the compilation flow for zoned neutral atom architectures, the effective interplay of the placement and routing stages decides the overhead caused by rearranging the atoms during the quantum computation. Sub-optimal placements can lead to unnecessary serialization of the rearrangements in the subsequent routing stage. Despite this, all existing compilers treat placement and routing independently thus far - focusing solely on minimizing travel distances. This work introduces the first routing-aware placement method to address this shortcoming. It groups compatible movements into parallel rearrangement steps to minimize both rearrangement steps and travel distances. The implementation utilizing the A* algorithm reduces the rearrangement time by 17% on average and by 49% in the best case compared to the state-of-the-art. The complete code is publicly available in open-source as part of the Munich Quantum Toolkit (MQT) at https://github.com/munich-quantum-toolkit/qmap.
中文摘要:本研究首次提出了针对分区中性原子量子计算机的路由感知布局方法,通过将兼容移动分组并行处理,相比现有技术平均减少17%的重排时间。
English Summary: This work introduces the first routing-aware placement method for zoned neutral atom quantum computers, which groups compatible movements to reduce rearrangement time by 17% on average compared to existing approaches.

Authors:Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, Tao Mei
Title: HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
Abstract:
Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.
中文: HiDream-I1是一个拥有170亿参数的开源图像生成模型,采用新型稀疏扩散变换器和动态专家混合架构,可在数秒内实现顶尖图像生成质量,并能扩展为支持文本生成图像和指令编辑的完整图像代理系统。
English: HiDream-I1 is a 17B-parameter open-source image generation model that achieves state-of-the-art quality within seconds using a novel sparse Diffusion Transformer structure with dynamic Mixture-of-Experts, and it evolves into a comprehensive image agent supporting both text-to-image generation and instruction-based editing.

Authors:Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele RodolÃ, Simone Calderara, Angelo Porrello
Title: Update Your Transformer to the Latest Release: Re-Basin of Task Vectors
Abstract:
Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion.
Chinese: 本研究提出了一种无需数据的迁移方法,通过基于模型重定基原理的权重置换,将微调知识无缝转移到更新的预训练模型上,特别针对Transformer的残差连接和多头注意力层进行了优化。
English: This study introduces a data-free method to transfer fine-tuned knowledge to updated pre-trained models by applying weight permutations based on model re-basin principles, specifically tailored for Transformers to handle residual connections and multi-head attention layers.

Authors:Pawan Neupane, Jian Liu, Jianlin Cheng
Title: PSBench: a large-scale benchmark for estimating the accuracy of protein complex structural models
Abstract:
Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large-scale, labeled datasets generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench's utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.
中文: PSBench通过提供包含超过百万标注结构模型的综合基准,解决了评估蛋白质复合物模型质量的难题,促进了如图变换器方法GATE等机器学习技术的发展,该方法在盲测中表现优异。
English: PSBench addresses the challenge of estimating protein complex model quality by providing a comprehensive benchmark with over one million annotated structural models, facilitating the development of machine learning methods like the graph transformer-based GATE, which demonstrated top performance in blind testing.

Authors:Kaiyu Yue, Vasu Singla, Menglin Jia, John Kirchenbauer, Rifaa Qadri, Zikui Cai, Abhinav Bhatele, Furong Huang, Tom Goldstein
Title: Zero-Shot Vision Encoder Grafting via LLM Surrogates
Abstract:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder. The code is at https://github.com/facebookresearch/zero.
中文: 视觉语言模型通过先用共享目标大语言模型嵌入空间的小型替代模型训练视觉编码器,实现直接迁移,不仅性能相当,还能将训练成本降低约45%。
English: Vision language models can reduce training costs by first training the vision encoder with a small surrogate model that shares the target large language model's embedding space, enabling direct transfer and achieving comparable performance while cutting costs by about 45%.

Authors:Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Abstract:
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.
中文: 近期大型视觉语言模型因冗长视觉标记导致计算效率低下,为此提出VScan框架,通过整合全局-局部扫描与中间层剪枝的两阶段方法减少标记冗余,在加速推理的同时保持高性能。
English: Recent Large Vision-Language Models face computational inefficiency from lengthy visual tokens, prompting the development of VScan, a two-stage framework that reduces token redundancy through integrated global-local scans and intermediate layer pruning to accelerate inference while maintaining high performance.

Authors:Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
Title: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
Abstract:
Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.
中文: 该研究发现大语言模型对奖励噪声具有强鲁棒性,通过仅奖励关键推理模式而不验证答案正确性即可达到相近性能,为改进预训练与后训练技术提供了新思路。
English: This study reveals that large language models exhibit strong robustness to reward noise, achieving comparable performance through reasoning pattern rewards without strict correctness verification, and provides insights for enhancing both pre-training and post-training techniques.

Authors:Guoxuan Chen, Lianghao Xia, Chao Huang
Title: Pre-training for Recommendation Unlearning
Abstract:
Modern recommender systems powered by Graph Neural Networks (GNNs) excel at modeling complex user-item interactions, yet increasingly face scenarios requiring selective forgetting of training data. Beyond user requests to remove specific interactions due to privacy concerns or preference changes, regulatory frameworks mandate recommender systems' ability to eliminate the influence of certain user data from models. This recommendation unlearning challenge presents unique difficulties as removing connections within interaction graphs creates ripple effects throughout the model, potentially impacting recommendations for numerous users. Traditional approaches suffer from significant drawbacks: fragmentation methods damage graph structure and diminish performance, while influence function techniques make assumptions that may not hold in complex GNNs, particularly with self-supervised or random architectures. To address these limitations, we propose a novel model-agnostic pre-training paradigm UnlearnRec that prepares systems for efficient unlearning operations. Our Influence Encoder takes unlearning requests together with existing model parameters and directly produces updated parameters of unlearned model with little fine-tuning, avoiding complete retraining while preserving model performance characteristics. Extensive evaluation on public benchmarks demonstrates that our method delivers exceptional unlearning effectiveness while providing more than 10x speedup compared to retraining approaches. We release our method implementation at: https://github.com/HKUDS/UnlearnRec.
Chinese: 提出的UnlearnRec框架通过影响编码器直接更新模型参数,实现了推荐系统的高效数据遗忘,相比重新训练方法在保持性能的同时获得了超过10倍的速度提升。
English: The proposed UnlearnRec framework enables efficient data removal in recommender systems by using an Influence Encoder to directly update model parameters, achieving over 10x faster unlearning while maintaining performance compared to retraining.

Authors:Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: WebDancer: Towards Autonomous Information Seeking Agency
Abstract:
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.
中文: 本文提出了一种构建端到端自主信息检索智能体的完整训练范式,通过四阶段训练流程在基准测试中表现出色,为开发更强大的智能体模型提供了系统路径。
English: This paper introduces a comprehensive training framework for developing autonomous information-seeking agents, which achieves strong performance on benchmarks through a four-stage process including data construction and reinforcement learning.

Authors:Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, Wenhan Luo
Title: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Abstract:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
中文摘要:本文提出MultiTalk框架,通过标签旋转位置嵌入解决多人对话视频生成中的音画绑定问题,并采用部分参数训练和多任务训练保持基础模型的指令跟随能力,在多个数据集上展现出优于现有方法的生成性能。
English Summary: The paper introduces MultiTalk, a novel framework for multi-person conversational video generation that resolves audio-person binding issues with Label Rotary Position Embedding and maintains instruction-following ability through specialized training techniques, outperforming existing methods across multiple datasets.

Authors:Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke
Title: Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Abstract:
While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).
中文:本研究揭示了大型语言模型在处理简体与繁体中文时存在任务依赖性的性能差异,这些偏差受训练数据和语言特征影响,并提供了开源基准以促进未来跨中文变体的模型评估。
English: This study investigates performance disparities in Large Language Models (LLMs) when processing Simplified versus Traditional Chinese, revealing task-dependent biases influenced by training data and linguistic differences, and provides an open-source benchmark for future evaluations.

Authors:Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Abstract:
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
中文: RICO框架通过将描述重构为参考图像并识别差异来迭代优化图像描述,提升准确性和完整性,同时RICO-Flash利用DPO提高效率,在多个基准测试中显著优于现有方法。
English: The proposed RICO framework iteratively refines image captions by reconstructing them into reference images and identifying discrepancies to enhance accuracy and completeness, with RICO-Flash optimizing efficiency through DPO, achieving significant improvements over existing methods.

Authors:Dmitrii Sorokin, Maksim Nakhodnov, Andrey Kuznetsov, Aibek Alanov
Title: ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models
Abstract:
Recent advances in diffusion models have led to impressive image generation capabilities, but aligning these models with human preferences remains challenging. Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs. In this work, we address this trade-off with two contributions. First, we introduce \textit{combined generation}, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process, while preserving the base model for earlier steps. This approach mitigates early-stage overfitting and helps retain global structure and diversity. Second, we propose \textit{ImageReFL}, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images and incorporating multiple regularizers, including diffusion and ReFL losses. Our approach outperforms conventional reward tuning methods on standard quality and diversity metrics. A user study further confirms that our method better balances human preference alignment and visual diversity. The source code can be found at https://github.com/ControlGenAI/ImageReFL .
中文摘要:本文提出了一种结合新型采样策略和微调方法的两部分方案,有效提升了扩散模型在人类偏好对齐与图像多样性之间的平衡能力,其表现优于传统的奖励调优方法。
English Summary: This paper introduces a two-part approach combining a novel sampling strategy and a fine-tuning method to enhance diffusion models by better balancing human preference alignment with image diversity, outperforming traditional reward tuning methods.

Authors:Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
Title: Thinking with Generated Images
Abstract:
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
中文摘要:本文提出“生成图像思维”新范式,通过让大型多模态模型自主生成并优化中间视觉思维步骤,从根本上改变了视觉推理方式,在复杂场景中实现高达50%的相对性能提升。
English Summary: This paper introduces "Thinking with Generated Images," a paradigm that enhances large multimodal models' visual reasoning by enabling them to spontaneously generate and refine intermediate visual thoughts, achieving up to 50% relative improvement in complex scenarios.

Authors:Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu
Title: Sparsification and Reconstruction from the Perspective of Representation Geometry
Abstract:
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.
中文摘要:本研究提出SAEMA分析稀疏自编码器如何组织语言模型表征,发现稀疏编码通过几何重构增强特征区分度,并直接影响重建性能。
English Summary: This study introduces SAEMA to analyze how sparse autoencoders organize language model representations, revealing that sparse encoding enhances feature distinctiveness through geometric restructuring and directly impacts reconstruction performance.

Authors:Zobia Batool, Huseyin Ozkan, Erchan Aptoula
Title: Single Domain Generalization for Alzheimer's Detection from 3D MRIs with Pseudo-Morphological Augmentations and Contrastive Learning
Abstract:
Although Alzheimer's disease detection via MRIs has advanced significantly thanks to contemporary deep learning models, challenges such as class imbalance, protocol variations, and limited dataset diversity often hinder their generalization capacity. To address this issue, this article focuses on the single domain generalization setting, where given the data of one domain, a model is designed and developed with maximal performance w.r.t. an unseen domain of distinct distribution. Since brain morphology is known to play a crucial role in Alzheimer's diagnosis, we propose the use of learnable pseudo-morphological modules aimed at producing shape-aware, anatomically meaningful class-specific augmentations in combination with a supervised contrastive learning module to extract robust class-specific representations. Experiments conducted across three datasets show improved performance and generalization capacity, especially under class imbalance and imaging protocol variations. The source code will be made available upon acceptance at https://github.com/zobia111/SDG-Alzheimer.
中文: 本研究通过引入可学习的伪形态学模块和监督对比学习,解决了阿尔茨海默病MRI检测中的泛化难题,在类别不平衡和成像协议差异下显著提升了模型性能与跨域适应能力。
English: This study tackles Alzheimer's disease detection challenges in MRI analysis by introducing learnable pseudo-morphological modules and supervised contrastive learning to enhance generalization across diverse domains, demonstrating improved performance despite class imbalance and protocol variations.

Authors:Qiucheng Yu, Yuan Xie, Xin Tan
Title: SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels
Abstract:
3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model's bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at https://github.com/ge95net/SHTOcc
中文摘要:本文提出SHTOcc方法,通过稀疏头尾体素构建和解耦学习解决体素分布不平衡问题,在内存效率、推理速度和精度方面均实现显著提升。
English Summary: This paper introduces SHTOcc, a 3D occupancy prediction method that addresses voxel distribution imbalances through sparse head-tail voxel construction and decoupled learning, achieving significant improvements in memory efficiency, inference speed, and accuracy.

Authors:Seun-An Choe, Keon-Hee Park, Jinwoo Choi, Gyeong-Moon Park
Title: Universal Domain Adaptation for Semantic Segmentation
Abstract:
Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA-SS methods assume that category settings between source and target domains are known, which is unrealistic in real-world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Matching and Prototype-based Distinction, a novel framework composed of two key components. First, Domain-Specific Prototype-based Distinction (DSPD) divides each class into two domain-specific prototypes, enabling finer separation of domain-specific features and enhancing the identification of common classes across domains. Second, Target-based Image Matching (TIM) selects a source image containing the most common-class pixels based on the target pseudo-label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA-SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at https://github.com/KU-VGI/UniMAP.
中文摘要:提出的UniMAP框架通过引入领域特定原型和目标图像匹配方法,有效解决了语义分割中未知类别设置的通用领域自适应问题,显著超越了现有基准方法。
English Summary: The proposed UniMAP framework addresses universal domain adaptation for semantic segmentation by introducing domain-specific prototypes and target-based image matching to effectively handle unknown category settings between domains, significantly outperforming existing methods.

Authors:Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Abstract:
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
中文: MM-UPT提出了一种基于GRPO和无监督自奖励机制的后训练框架,无需外部监督即可显著提升多模态大语言模型的推理能力,其效果甚至接近有监督方法。
English: MM-UPT introduces an unsupervised post-training framework using GRPO with a self-rewarding mechanism, significantly enhancing MLLMs' reasoning without external supervision and even approaching supervised method performance.

Authors:Mattie Fellows, Clarisse Wibault, Uljad Berdica, Johannes Forkel, Michael A. Osborne, Jakob N. Foerster
Title: SOReL and TOReL: Two Methods for Fully Offline Reinforcement Learning
Abstract:
Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment. In practice, however, current offline RL methods rely on extensive online interactions for hyperparameter tuning, and have no reliable bound on their initial online performance. To address these two issues, we introduce two algorithms. Firstly, SOReL: an algorithm for safe offline reinforcement learning. Using only offline data, our Bayesian approach infers a posterior over environment dynamics to obtain a reliable estimate of the online performance via the posterior predictive uncertainty. Crucially, all hyperparameters are also tuned fully offline. Secondly, we introduce TOReL: a tuning for offline reinforcement learning algorithm that extends our information rate based offline hyperparameter tuning methods to general offline RL approaches. Our empirical evaluation confirms SOReL's ability to accurately estimate regret in the Bayesian setting whilst TOReL's offline hyperparameter tuning achieves competitive performance with the best online hyperparameter tuning methods using only offline data. Thus, SOReL and TOReL make a significant step towards safe and reliable offline RL, unlocking the potential for RL in the real world. Our implementations are publicly available: https://github.com/CWibault/sorel\_torel.
Chinese: 本文提出了SOReL和TOReL两种算法,通过实现安全的性能评估和完全离线的超参数调优,解决了离线强化学习中的关键限制,从而提升了强化学习在现实世界应用中的可行性。
English: The paper introduces two algorithms, SOReL and TOReL, which address key limitations in offline reinforcement learning by enabling safe performance estimation and fully offline hyperparameter tuning, thereby enhancing the practicality of RL for real-world applications.

Authors:Van-Tin Luu, Yon-Lin Cai, Vu-Hoang Tran, Wei-Chen Chiu, Yi-Ting Chen, Ching-Chun Huang
Title: RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
Abstract:
This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird's-eye views. The frontal view contains rich but sensitive height information, whereas the bird's-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research. The code is available at https://github.com/nycu-acm/RC-AutoCalib.
中文摘要:本文提出了首个雷达-相机系统的在线自动标定方法,通过双视角特征融合和跨模态注意力机制有效解决了数据稀疏性和高度不确定性问题,在nuScenes数据集上实现了超越现有最佳方法的性能。
English Summary: This paper introduces the first online automatic calibration method for radar-camera systems, employing dual-perspective feature fusion and cross-modal attention to overcome data sparsity and height uncertainty, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Václav Voráček, Francesco Orabona
Title: STaR-Bets: Sequential Target-Recalculating Bets for Tighter Confidence Intervals
Abstract:
The construction of confidence intervals for the mean of a bounded random variable is a classical problem in statistics with numerous applications in machine learning and virtually all scientific fields. In particular, obtaining the tightest possible confidence intervals is vital every time the sampling of the random variables is expensive. The current state-of-the-art method to construct confidence intervals is by using betting algorithms. This is a very successful approach for deriving optimal confidence sequences, even matching the rate of law of iterated logarithms. However, in the fixed horizon setting, these approaches are either sub-optimal or based on heuristic solutions with strong empirical performance but without a finite-time guarantee. Hence, no betting-based algorithm guaranteeing the optimal $\mathcal{O}(\sqrt{\frac{σ^2\log\frac1δ}{n}})$ width of the confidence intervals are known. This work bridges this gap. We propose a betting-based algorithm to compute confidence intervals that empirically outperforms the competitors. Our betting strategy uses the optimal strategy in every step (in a certain sense), whereas the standard betting methods choose a constant strategy in advance. Leveraging this fact results in strict improvements even for classical concentration inequalities, such as the ones of Hoeffding or Bernstein. Moreover, we also prove that the width of our confidence intervals is optimal up to an $1+o(1)$ factor diminishing with $n$. The code is available on~https://github.com/vvoracek/STaR-bets-confidence-interval.
Chinese: 本研究提出了一种基于博弈策略的新算法,用于构建有界随机变量的置信区间,在保证理论最优性的同时,其实际表现也优于现有方法。
English: This study introduces a novel betting-based algorithm that constructs optimal confidence intervals for bounded random variables, achieving theoretically guaranteed tightness and empirical superiority over existing methods.

Authors:Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, Shanghang Zhang
Title: GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control
Abstract:
Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.
中文: GeoDrive通过将稳健的3D几何条件融入驾驶世界模型,提升了自动驾驶的空间感知和行动可控性,从而实现更真实、适应性更强的仿真场景建模。
English: GeoDrive enhances autonomous driving safety by integrating robust 3D geometry into world models, improving spatial awareness and action controllability for more realistic and adaptable simulations.

Authors:Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Title: Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Abstract:
Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.
中文: 本文提出流形导向方法,通过将激活干预投影到低维流形上减少大型推理模型的计算开销,在数学和编程任务中最多减少71%的生成标记数,同时保持甚至提高准确性。
English: This paper introduces Manifold Steering, a method that reduces computational overhead in Large Reasoning Models by projecting activation interventions onto a low-dimensional manifold, achieving up to 71% fewer tokens while maintaining or improving accuracy across mathematical and coding tasks.

Authors:Enfang Cui, Yujun Cheng, Rui She, Dan Liu, Zhiyuan Liang, Minxin Guo, Tianzheng Li, Qian Wei, Wenjuan Xing, Zhijie Zhong
Title: AgentDNS: A Root Domain Naming System for LLM Agents
Abstract:
The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross-vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent-to-agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi-agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third-party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi-agent collaboration in real-world scenarios. The source code will be published on https://github.com/agentdns.
中文: 摘要指出跨供应商大语言模型代理在服务发现方面缺乏标准化协议,并提出了AgentDNS系统,实现第三方服务的自主发现与安全调用。
English: The abstract identifies a gap in standardized service discovery protocols for cross-vendor LLM agents and introduces AgentDNS, a system enabling autonomous discovery and secure invocation of third-party services.

Authors:Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Schütze
Title: Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model's representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT. The source code is in the anonymous Github repository\footnote{https://github.com/misonsky/PEFTEval}.
Chinese: 参数高效微调方法虽然在计算上更经济,但由于参数空间受限导致表示能力和鲁棒性不足,在复杂任务中表现不及全参数微调,理论和实验均证实了这一点。
English: Parameter-Efficient Fine-Tuning (PEFT) methods, while computationally economical, fall short of Full Fine-Tuning (FFT) in complex tasks due to limited representational capacity and robustness, as proven theoretically and validated across diverse datasets.

Authors:Anjie Xu, Ruiqing Ding, Leye Wang
Title: ChatPD: An LLM-driven Paper-Dataset Networking System
Abstract:
Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90\% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.
Chinese: ChatPD 是一个利用大语言模型自动从学术论文中提取数据集信息并构建结构化论文-数据集网络的系统,其性能超越人工平台PapersWithCode,在实体解析任务中达到约90%的准确率,同时提供数据集发现服务。
English: ChatPD is an automated system using Large Language Models to efficiently extract and structure dataset information from academic papers, outperforming manual platforms like PapersWithCode with high precision in entity resolution and enabling dataset discovery services.

Authors:Shriram M S, Xinyue Hao, Shihao Hou, Yang Lu, Laura Sevilla-Lara, Anurag Arnab, Shreyank N Gowda
Title: Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training
Abstract:
The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision
Chinese: 本文提出的渐进式数据丢弃训练方法,在无需改变模型结构或优化器的前提下,显著减少了所需训练轮次,同时保持甚至提升了模型精度。
English: The paper introduces Progressive Data Dropout, a training method that significantly reduces the number of epochs needed while maintaining or even improving accuracy, without requiring changes to model architecture or optimizer.

Authors:Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: Text2Grad: Reinforcement Learning from Natural Language Feedback
Abstract:
Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad
中文摘要:Text2Grad提出了一种创新的强化学习方法,将自由形式的文本反馈转化为精确的片段级梯度,在多个任务中超越传统标量奖励方法的同时,实现了细粒度的模型优化并增强了可解释性。
English Summary: Text2Grad introduces a novel reinforcement learning approach that converts free-form textual critiques into precise span-level gradients, enabling fine-grained model optimization that outperforms traditional scalar-reward methods across multiple tasks while providing enhanced interpretability.

Authors:Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Abstract:
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
中文: 本研究表明多模态大语言模型中的自我修正模式在强化学习训练前就已存在,并提出结合监督微调与强化学习的两阶段方法,在推理基准测试中实现了最优性能。
English: This study demonstrates that self-correction patterns in multimodal LLMs exist before RL training and proposes a two-stage approach combining supervised fine-tuning and reinforcement learning, achieving state-of-the-art performance on reasoning benchmarks.

Authors:Ganlin Xu, Zhoujia Zhang, Wangyi Mei, Jiaqing Liang, Weijia Lu, Xiaodong Zhang, Zhifei Yang, Xiaofeng Ma, Yanghua Xiao, Deqing Yang
Title: Logical Consistency is Vital: Neural-Symbolic Information Retrieval for Negative-Constraint Queries
Abstract:
Information retrieval plays a crucial role in resource localization. Current dense retrievers retrieve the relevant documents within a corpus via embedding similarities, which compute similarities between dense vectors mainly depending on word co-occurrence between queries and documents, but overlook the real query intents. Thus, they often retrieve numerous irrelevant documents. Particularly in the scenarios of complex queries such as \emph{negative-constraint queries}, their retrieval performance could be catastrophic. To address the issue, we propose a neuro-symbolic information retrieval method, namely \textbf{NS-IR}, that leverages first-order logic (FOL) to optimize the embeddings of naive natural language by considering the \emph{logical consistency} between queries and documents. Specifically, we introduce two novel techniques, \emph{logic alignment} and \emph{connective constraint}, to rerank candidate documents, thereby enhancing retrieval relevance. Furthermore, we construct a new dataset \textbf{NegConstraint} including negative-constraint queries to evaluate our NS-IR's performance on such complex IR scenarios. Our extensive experiments demonstrate that NS-IR not only achieves superior zero-shot retrieval performance on web search and low-resource retrieval tasks, but also performs better on negative-constraint queries. Our scource code and dataset are available at https://github.com/xgl-git/NS-IR-main.
中文: 现有密集检索器常忽略查询真实意图导致结果不相关,而提出的神经符号方法NS-IR利用一阶逻辑优化嵌入,通过逻辑对齐和连接约束提升检索相关性。
English: Current dense retrievers often fail to capture query intents, leading to irrelevant results, but the proposed neuro-symbolic method NS-IR uses first-order logic to optimize embeddings and improve retrieval relevance through logic alignment and connective constraints.

Authors:Haosheng Zou, Xiaowei Lv, Shousheng Jia, Xiangzheng Zhang
Title: 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Abstract:
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
Chinese: 我们开源了集成序列并行技术的360-LLA MA-Factory,该框架已在多个模型及企业训练系统中获得广泛应用。
English: We have open-sourced 360-LLaMA-Factory with sequence parallelism, which has been widely adopted in various models and corporate training frameworks.

Authors:Haosheng Zou, Xiaowei Lv, Shousheng Jia, Lin Li, Xiaochun Gong, Xiangzheng Zhang
Title: 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Abstract:
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
Chinese: 我们开源了集成序列并行技术的360-LLA MA-Factory,该框架已在多个模型及企业训练系统中获得广泛应用。
English: We have open-sourced 360-LLaMA-Factory with sequence parallelism, which has been widely adopted in various models and corporate training frameworks.

Authors:Haibin He, Jing Zhang, Maoyuan Ye, Juhua Liu, Bo Du, Dacheng Tao
Title: GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Abstract:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.
中文: 本文提出GoMatching++方法,通过冻结图像文本识别器并引入轻量级跟踪器,有效将现成的图像文本识别器转化为视频专用系统,在多个基准测试中创下性能记录并显著降低训练成本。
English: This paper introduces GoMatching++, a method that efficiently converts an image text spotter into a specialized video text spotter by adding a lightweight tracker and addressing domain gaps, achieving state-of-the-art results on benchmarks while reducing training costs.

Authors:Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li
Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon
Abstract:
Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively. Our code and data are available at https://github.com/XDxc-cuber/C2TU-Chinese-cloaked-toxicity-unveiling.
中文摘要:本研究提出C$^2$TU方法,通过识别中文同音替换词并结合BERT与大语言模型过滤非毒性内容,有效检测中文伪装毒性文本,在多项指标上显著超越现有最佳方法。
English Summary: The study introduces C$^2$TU, a training-free method for detecting disguised toxic content in Chinese text by identifying homophonic substitutions and filtering non-toxic candidates using BERT and LLMs, achieving significant performance improvements over existing approaches.

Authors:Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Abstract:
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
中文总结:推测解码与量化相结合可加速大语言模型推理,但实验发现4位量化的内存优势会被推测解码的计算负载抵消,因此提出分层框架,通过小模型将树状草案转为序列草案,显著提升量化模型性能。
English Summary: Speculative decoding and quantization are combined to accelerate large language model inference, but their integration reveals that 4-bit quantization's memory benefits are offset by computational overhead, leading to a new hierarchical framework that achieves significant speedup improvements.

Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno
Title: Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
Abstract:
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.
中文: 扩散变换器在视频生成中表现卓越,但高计算需求限制了其在边缘设备上的部署;为此开发的Q-VDiT量化框架通过令牌感知误差补偿和时间蒸馏技术,显著提升了性能并设定了新基准。
English: Diffusion transformers excel in video generation but face deployment challenges on edge devices due to high computational demands, prompting the development of Q-VDiT, a specialized quantization framework that enhances performance through token-aware error compensation and temporal distillation.

Authors:Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan
Title: Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices
Abstract:
Speech emotion recognition (SER), particularly for naturally expressed emotions, remains a challenging computational task. Key challenges include the inherent subjectivity in emotion annotation and the imbalanced distribution of emotion labels in datasets. This paper introduces the \texttt{SAILER} system developed for participation in the INTERSPEECH 2025 Emotion Recognition Challenge (Task 1). The challenge dataset, which contains natural emotional speech from podcasts, serves as a valuable resource for studying imbalanced and subjective emotion annotations. Our system is designed to be simple, reproducible, and effective, highlighting critical choices in modeling, learning objectives, data augmentation, and engineering choices. Results show that even a single system (without ensembling) can outperform more than 95\% of the submissions, with a Macro-F1 score exceeding 0.4. Moreover, an ensemble of three systems further improves performance, achieving a competitively ranked score (top-3 performing team). Our model is at: https://github.com/tiantiaf0627/vox-profile-release.
中文: SAILER系统针对语音情感识别中的主观标注和数据不平衡等难题,通过简洁高效的设计,在INTERSPEECH 2025挑战赛中取得了领先性能。
English: The SAILER system addresses challenges in speech emotion recognition, such as subjective annotations and imbalanced data, achieving top performance in the INTERSPEECH 2025 challenge with a simple yet effective design.

Authors:Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Title: EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning
Abstract:
Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at https://github.com/NEUIR/EULER.
中文: EULER模型通过生成高质量解题错误来增强大语言模型的数学推理能力,在监督微调中实现超过4%的性能提升。
English: The EULER model enhances LLMs' mathematical reasoning by generating high-quality solution errors during supervised fine-tuning, achieving over 4% improvement across datasets.

Authors:Haidong Xin, Qiushi Xiong, Zhenghao Liu, Sen Mei, Yukun Yan, Shi Yu, Shuo Wang, Yu Gu, Ge Yu, Chenyan Xiong
Title: ConsRec: Denoising Sequential Recommendation through User-Consistent Preference Modeling
Abstract:
User-item interaction histories are pivotal for sequential recommendation systems but often include noise, such as unintended clicks or actions that fail to reflect genuine user preferences. To address this issue, we propose the User-Consistent Preference-based Sequential Recommendation System (ConsRec), designed to capture stable user preferences and filter noisy items from interaction histories. Specifically, ConsRec constructs a user-interacted item graph, learns item similarities from their text representations, and then extracts the maximum connected subgraph from the user-interacted item graph for denoising items. Experimental results on the Yelp and Amazon Product datasets illustrate that ConsRec achieves a 13% improvement over baseline recommendation models, showing its effectiveness in denoising user-interacted items. Further analysis reveals that the denoised interaction histories form semantically tighter clusters of user-preferred items, leading to higher relevance scores for ground-truth targets and more accurate recommendations. All codes are available at https://github.com/NEUIR/ConsRec.
中文:提出的ConsRec系统通过过滤用户交互中的噪声并捕捉稳定偏好来改进序列推荐,在基准数据集上实现了13%的性能提升。
English: The proposed ConsRec system enhances sequential recommendations by filtering noise from user interactions and capturing stable preferences, achieving a 13% performance improvement on benchmark datasets.

Authors:Runyu Wang, Peng Ping, Zhengyu Guo, Xiaoye Zhang, Quan Shi, Liting Zhou, Tianbo Ji
Title: LoKI: Low-damage Knowledge Implanting of Large Language Models
Abstract:
Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbf{Lo}w-damage \textbf{K}nowledge \textbf{I}mplanting (\textbf{LoKI}), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnote{https://github.com/Nexround/LoKI}.
中文: LoKI是一种基于Transformer知识存储机制理解的参数高效微调方法,通过低损伤知识植入技术,在保证任务性能的同时显著减少灾难性遗忘,有效平衡了专业化与通用能力。
English: LoKI is a parameter-efficient fine-tuning method that mitigates catastrophic forgetting by leveraging mechanistic insights into transformer knowledge storage, achieving superior task performance while preserving general capabilities across various models.

Authors:Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie
Title: Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model
Abstract:
Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.
中文:IOHFuseLM框架采用多模态语言模型和两阶段训练策略,通过整合静态与动态患者数据精确预测术中低血压,在临床评估中展现出卓越性能。
English: The IOHFuseLM framework uses a multimodal language model with a two-stage training strategy to accurately predict intraoperative hypotension by integrating static and dynamic patient data, demonstrating superior performance in clinical evaluations.

Authors:Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Title: Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Abstract:
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.
中文: Transformer架构的大型语言模型因冗余注意力计算存在效率问题,本文提出的动态分组注意力(DGA)方法通过聚合次要标记显著降低计算成本,同时保持模型性能竞争力。
English: Transformer-based LLMs face computational inefficiency from redundant attention computations, which the proposed Dynamic Group Attention (DGA) method addresses by aggregating less important tokens to reduce costs while maintaining performance.

Authors:Hang Chen, Maoyuan Ye, Peng Yang, Haibin He, Juhua Liu, Bo Du
Title: Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation
Abstract:
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.
中文: 本文提出了ELE-SAM模型,基于Segment Anything Model改进用于电力传输走廊危险目标分割,通过上下文感知提示适配器和高保真掩码解码器增强复杂背景下精细结构目标的识别能力,并在新构建的ELE-40K数据集上验证了其显著性能提升。
English: This paper introduces ELE-SAM, an enhanced model based on the Segment Anything Model (SAM) for power transmission corridor hazard segmentation, featuring a Context-Aware Prompt Adapter and a High-Fidelity Mask Decoder to improve segmentation of fine-structured objects in complex backgrounds, validated by the newly constructed ELE-40K dataset with significant performance gains.

Authors:Junhuan Liu, San Jiang, Wei Ge, Wei Huang, Bingxuan Guo, Qingquan Li
Title: UAVPairs: A Challenging Benchmark for Match Pair Retrieval of Large-scale UAV Images
Abstract:
The primary contribution of this paper is a challenging benchmark dataset, UAVPairs, and a training pipeline designed for match pair retrieval of large-scale UAV images. First, the UAVPairs dataset, comprising 21,622 high-resolution images across 30 diverse scenes, is constructed; the 3D points and tracks generated by SfM-based 3D reconstruction are employed to define the geometric similarity of image pairs, ensuring genuinely matchable image pairs are used for training. Second, to solve the problem of expensive mining cost for global hard negative mining, a batched nontrivial sample mining strategy is proposed, leveraging the geometric similarity and multi-scene structure of the UAVPairs to generate training samples as to accelerate training. Third, recognizing the limitation of pair-based losses, the ranked list loss is designed to improve the discrimination of image retrieval models, which optimizes the global similarity structure constructed from the positive set and negative set. Finally, the effectiveness of the UAVPairs dataset and training pipeline is validated through comprehensive experiments on three distinct large-scale UAV datasets. The experiment results demonstrate that models trained with the UAVPairs dataset and the ranked list loss achieve significantly improved retrieval accuracy compared to models trained on existing datasets or with conventional losses. Furthermore, these improvements translate to enhanced view graph connectivity and higher quality of reconstructed 3D models. The models trained by the proposed approach perform more robustly compared with hand-crafted global features, particularly in challenging repetitively textured scenes and weakly textured scenes. For match pair retrieval of large-scale UAV images, the trained image retrieval models offer an effective solution. The dataset would be made publicly available at https://github.com/json87/UAVPairs.
中文: 本文提出了UAVPairs基准数据集,包含21,622张高分辨率无人机图像,以及配套的训练流程,通过批量非平凡样本挖掘策略和排序列表损失函数,显著提升了图像检索精度和三维重建质量。
English: This paper introduces UAVPairs, a challenging benchmark dataset with 21,622 high-resolution UAV images, and a training pipeline that includes a batched nontrivial sample mining strategy and ranked list loss to significantly improve retrieval accuracy and 3D reconstruction quality.

Authors:Valentin Cuzin-Rambaud, Emilien Komlenovic, Alexandre Faure, Bruno Yun
Title: VIRAL: Vision-grounded Integration for Reward design And Learning
Abstract:
The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.
中文: VIRAL是一种创新流程,通过多模态大语言模型自主生成并优化奖励函数,在多种环境中加速行为学习的同时,显著提升了与人类意图的对齐效果。
English: VIRAL is an innovative pipeline that uses multi-modal large language models to autonomously create and refine reward functions, accelerating behavior learning while enhancing alignment with human intent across various environments.

Authors:Ruxiao Chen, Dezheng Han, Wenjie Han, Shuaishuai Guo
Title: Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired
Abstract:
Assistive systems for visually impaired individuals must deliver rapid, interpretable, and adaptive feedback to facilitate real-time navigation. Current approaches face a trade-off between latency and semantic richness: natural language-based systems provide detailed guidance but are too slow for dynamic scenarios, while emergent communication frameworks offer low-latency symbolic languages but lack semantic depth, limiting their utility in tactile modalities like vibration. To address these limitations, we introduce a novel framework, Cognitively-Inspired Emergent Communication via Knowledge Graphs (VAG-EC), which emulates human visual perception and cognitive mapping. Our method constructs knowledge graphs to represent objects and their relationships, incorporating attention mechanisms to prioritize task-relevant entities, thereby mirroring human selective attention. This structured approach enables the emergence of compact, interpretable, and context-sensitive symbolic languages. Extensive experiments across varying vocabulary sizes and message lengths demonstrate that VAG-EC outperforms traditional emergent communication methods in Topographic Similarity (TopSim) and Context Independence (CI). These findings underscore the potential of cognitively grounded emergent communication as a fast, adaptive, and human-aligned solution for real-time assistive technologies. Code is available at https://github.com/Anonymous-NLPcode/Anonymous_submission/tree/main.
中文摘要:VAG-EC框架通过知识图谱模拟人类认知机制,为视障导航开发出兼具低延迟与语义深度的新型符号通信系统,在语义质量指标上显著优于传统方法。
English Summary: The VAG-EC framework introduces a cognitively-inspired emergent communication system using knowledge graphs to enable fast, interpretable feedback for visually impaired navigation, outperforming existing methods in semantic quality metrics while maintaining low latency.

Authors:Shun Sato, Issei Sato
Title: Can Test-time Computation Mitigate Memorization Bias in Neural Symbolic Regression?
Abstract:
Symbolic regression aims to discover mathematical equations that fit given numerical data. It has been applied in various fields of scientific research, such as producing human-readable expressions that explain physical phenomena. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on large-scale synthetic datasets have gained attention. While these methods offer advantages such as short inference time, they suffer from low performance, particularly when the number of input variables is large. In this study, we hypothesized that this limitation stems from the memorization bias of Transformers in symbolic regression. We conducted a quantitative evaluation of this bias in Transformers using a synthetic dataset and found that Transformers rarely generate expressions not present in the training data. Additional theoretical analysis reveals that this bias arises from the Transformer's inability to construct expressions compositionally while verifying their numerical validity. We finally examined if tailoring test-time strategies can lead to reduced memorization bias and better performance. We empirically demonstrate that providing additional information to the model at test time can significantly mitigate memorization bias. On the other hand, we also find that reducing memorization bias does not necessarily correlate with improved performance. These findings contribute to a deeper understanding of the limitations of NSR approaches and offer a foundation for designing more robust, generalizable symbolic regression methods. Code is available at https://github.com/Shun-0922/Mem-Bias-NSR .
Chinese: 本研究识别并量化了神经符号回归中Transformer的记忆偏差,发现其难以生成训练数据外的新表达式,且测试时提供额外信息虽可减轻偏差,但未必提升性能。
English: This study identifies and quantifies the memorization bias in Transformers used for neural symbolic regression, revealing their limited ability to generate novel expressions and showing that while test-time information can reduce bias, it doesn't always improve performance.

Authors:Ran Li, Shimin Di, Yuchen Liu, Chen Jing, Yu Qiu, Lei Chen
Title: Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO
Abstract:
Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R$^2$GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R$^2$GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.
中文摘要:研究表明,结合监督微调(MimicSFT)和强化学习(R²GRPO)能提升科学信息抽取中的推理能力,超越基线模型表现。
English Summary: The study demonstrates that combining supervised fine-tuning (MimicSFT) and reinforcement learning (R²GRPO) enhances reasoning capacity in scientific information extraction, outperforming baseline models.

Authors:Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Gorno-Tempini, Gopala Anumanchipalli
Title: Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection
Abstract:
Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at https://github.com/Berkeley-Speech-Group/LLM-Dys.
Chinese: 作者提出了LLM-Dys,这是一个通过大语言模型增强的全面口吃语音语料库,旨在解决高质量标注数据稀缺和现有合成数据集不自然的问题,并实现了最先进的检测性能,所有资源均已开源。
English: The authors introduce LLM-Dys, a comprehensive dysfluent speech corpus enhanced by LLM simulation to overcome the limitations of scarce annotated data and unnatural synthetic datasets, achieving state-of-the-art detection performance with open-sourced resources.

Authors:Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang
Title: Weakly-Supervised Contrastive Learning for Imprecise Class Labels
Abstract:
Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at https://github.com/Speechless-10308/WSC.
中文: 本文提出了一种基于图的弱监督对比学习框架,通过连续语义相似性替代不可靠的类别标签来定义正负样本对,在噪声标签和部分标签场景中验证了有效性,并提供了理论保证。
English: This paper introduces a graph-based framework for weakly-supervised contrastive learning that uses continuous semantic similarity instead of unreliable class labels to define positive and negative pairs, demonstrating effectiveness in noisy and partial label settings while providing theoretical guarantees.

Authors:Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, Feng Zhao
Title: VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Abstract:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG.
中文摘要:VRAG-RL框架通过让视觉语言模型自主采样并优化推理轨迹,结合交互式搜索引擎查询和视觉感知标记,解决了当前视觉RAG方法在复杂视觉信息推理中的固有限制。
English Summary: The VRAG-RL framework addresses limitations in visual reasoning for RAG systems by enabling VLMs to autonomously sample and optimize reasoning trajectories through interactive search engine queries and visual perception tokens.

Authors:Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
Title: Improving Continual Pre-training Through Seamless Data Packing
Abstract:
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
中文: Seamless Packing (SP) 是一种新颖的数据打包策略,通过滑动窗口和首次适应递减算法在持续预训练中有效保持上下文连贯性,在99%的评估场景中显著超越基线方法。
English: Seamless Packing (SP) is a novel data packing strategy that uses a sliding window and First-Fit-Decreasing algorithm to preserve contextual integrity during continual pre-training, significantly outperforming baseline methods in 99% of evaluations.

Authors:Ziyang Zheng, Kezhi Li, Zhengyuan Shi, Qiang Xu
Title: Functional Matching of Logic Subgraphs: Beyond Structural Isomorphism
Abstract:
Subgraph matching in logic circuits is foundational for numerous Electronic Design Automation (EDA) applications, including datapath optimization, arithmetic verification, and hardware trojan detection. However, existing techniques rely primarily on structural graph isomorphism and thus fail to identify function-related subgraphs when synthesis transformations substantially alter circuit topology. To overcome this critical limitation, we introduce the concept of functional subgraph matching, a novel approach that identifies whether a given logic function is implicitly present within a larger circuit, irrespective of structural variations induced by synthesis or technology mapping. Specifically, we propose a two-stage multi-modal framework: (1) learning robust functional embeddings across AIG and post-mapping netlists for functional subgraph detection, and (2) identifying fuzzy boundaries using a graph segmentation approach. Evaluations on standard benchmarks (ITC99, OpenABCD, ForgeEDA) demonstrate significant performance improvements over existing structural methods, with average $93.8\%$ accuracy in functional subgraph detection and a dice score of $91.3\%$ in fuzzy boundary identification. The source code and implementation details can be found at https://github.com/zyzheng17/Functional_Subgraph_Matching-Neurips25.
中文: 本文提出功能子图匹配的新方法,通过两阶段多模态框架识别电路中的逻辑功能而不受结构变化影响,在功能检测和模糊边界识别中分别达到93.8%和91.3%的准确率。
English: This paper introduces functional subgraph matching, a novel approach that identifies logic functions in circuits regardless of structural changes, achieving 93.8% accuracy in detection and 91.3% in boundary identification through a two-stage multimodal framework.

Authors:Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine, Doaa Qawasmeh, Aminetou Yacoub, Tfeil moilid, Ruwa AbuHweidi, Ahmed Aboeitta, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Sara Shatnawi, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou cheikh tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Abstract:
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
中文: 为解决主流视觉语言模型中的文化偏见问题,PEARL数据集作为包含30.9万条阿拉伯文化多模态样本的基准资源应运而生,其评估表明以推理为核心的指令对齐方法相比传统扩展方式能更有效提升模型的文化认知能力。
English: To address cultural biases in large vision-language models, the PEARL dataset was developed as a comprehensive Arabic multimodal resource with over 309K culturally-grounded examples, demonstrating that instruction-focused alignment significantly enhances cultural understanding compared to standard scaling approaches.

Authors:Fakhraddin Alwajih, Samar M. Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmen, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El Aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Mohammed Anwar Al-Ghrawi, Aminetou Yacoub, Ruwa AbuHweidi, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Alcides Alcoba Inciarte, Adel Ammar, Abdelrahim A. Elmadany, Mohamedou Cheikh Tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Abstract:
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
中文: 为解决主流视觉语言模型中的文化偏见问题,PEARL数据集作为包含30.9万条阿拉伯文化多模态样本的基准资源应运而生,其评估表明以推理为核心的指令对齐方法相比传统扩展方式能更有效提升模型的文化认知能力。
English: To address cultural biases in large vision-language models, the PEARL dataset was developed as a comprehensive Arabic multimodal resource with over 309K culturally-grounded examples, demonstrating that instruction-focused alignment significantly enhances cultural understanding compared to standard scaling approaches.

Authors:Weiguang Zhang, Huangcheng Lu, Maizhen Ning, Xiaowei Huang, Wei Wang, Kaizhu Huang, Qiufeng Wang
Title: DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model
Abstract:
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks, including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available at https://github.com/hanquansanren/DvD.
中文: 本文提出DvD,首个基于扩散模型的文档去扭曲生成方法,通过坐标级去噪和时间变体条件优化机制,在矫正文档形变的同时有效保持结构完整性,在包括新构建的AnyPhotoDoc6300数据集在内的多个基准测试中均达到最优性能。
English: This paper introduces DvD, the first diffusion-based generative model for document dewarping that uses coordinate-level denoising and a time-variant condition refinement mechanism to effectively rectify document deformations while preserving structural integrity, achieving state-of-the-art performance across multiple benchmarks including the newly proposed AnyPhotoDoc6300 dataset.

Authors:Senmao Li, Lei Wang, Kai Wang, Tao Liu, Jiehang Xie, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang
Title: One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models
Abstract:
Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
中文: 本研究提出TiUE时间无关统一编码器,通过在多时间步解码器间共享编码器特征实现并行采样,大幅降低文生图扩散模型推理耗时,并借助KL散度正则化提升生成图像的多样性和真实感。
English: The study introduces TiUE, a time-independent unified encoder that enables parallel sampling in text-to-image diffusion models by sharing encoder features across decoders, significantly reducing inference time while enhancing image diversity and realism through KL divergence regularization.

Authors:Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee
Title: UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Abstract:
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
UniTalk是一个专为主动说话人检测设计的新数据集,专注于多样化的现实场景,如多语言和拥挤环境,以提升模型泛化能力,它作为一个新基准揭示了当前先进模型在真实条件下的不足。
UniTalk is a novel dataset designed for active speaker detection, focusing on challenging real-world scenarios like diverse languages and crowded scenes to improve model generalization, and it serves as a new benchmark that reveals the limitations of current state-of-the-art models under realistic conditions.

Authors:Wei Lin, Chenyang Zhao, Antoni B. Chan
Title: Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting
Abstract:
Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at https://github.com/Elin24/P2RLoss.
中文: 本文提出了一种点对区域(P2R)方案替代点对点(P2P)方法,通过局部区域共享置信度解决半监督行人计数中的过度激活问题,有效提升了伪标签数据的训练效果。
English: This paper introduces a point-to-region (P2R) scheme to replace the point-to-point (P2P) method, addressing over-activation issues in semi-supervised pedestrian counting by enabling confidence sharing in local regions and improving training with pseudo-labels.

Authors:Chenfeng Wei, Qi Wu, Si Zuo, Jiahua Xu, Boyang Zhao, Zeyu Yang, Guotao Xie, Shenhong Wang
Title: LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments
Abstract:
Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high-dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: https://github.com/vincentweikey/LiDARDustX.
Chinese: LiDARDustX数据集专门针对高粉尘环境下的感知任务,提供了30,000帧带标注的LiDAR数据,建立了三维检测与分割算法的性能基准,并深入分析了粉尘对感知精度的影响机制。
English: The LiDARDustX dataset addresses the lack of autonomous driving data in dusty environments by providing 30,000 LiDAR frames with annotations, enabling benchmark evaluation of 3D detection and segmentation algorithms while analyzing dust's impact on perception accuracy.

Authors:Jianchao Jiang, Haofeng Zhang
Title: Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation
Abstract:
Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at https://github.com/jcjiang99/CoW.
中文: 本文提出了一种新颖的少样本医学图像分割方法,通过支持自预测模块和硬原型生成聚焦于弱特征以提升边界清晰度,在三个公开数据集上实现了最优性能。
English: This paper introduces a novel few-shot medical image segmentation method that enhances boundary clarity by focusing on weak features through a Support Self-Prediction module and Hard Prototypes Generation, achieving state-of-the-art results on three datasets.

Authors:Tianyu Guo, Hande Dong, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao, Xianwei Zhang
Title: EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
Abstract:
Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling capability. EFIM's source code is publicly available at https://github.com/gty111/EFIM.
中文摘要:EFIM通过改进提示格式和引入分段标记化训练,有效提升大语言模型中KV缓存的复用效率,在保持原有填充能力的同时,将延迟降低52%、吞吐量提高98%。
English Summary: EFIM introduces a transformed prompt format and fragment tokenization training to enhance KV cache reuse in large language models, significantly reducing latency by 52% and boosting throughput by 98% while preserving infilling performance.

Authors:Ruijie Li, Xiang Zhao, Qiao Ning, Shikai Guo
Title: HydraNet: Momentum-Driven State Space Duality for Multi-Granularity Tennis Tournaments Analysis
Abstract:
In tennis tournaments, momentum, a critical yet elusive phenomenon, reflects the dynamic shifts in performance of athletes that can decisively influence match outcomes. Despite its significance, momentum in terms of effective modeling and multi-granularity analysis across points, games, sets, and matches in tennis tournaments remains underexplored. In this study, we define a novel Momentum Score (MS) metric to quantify a player's momentum level in multi-granularity tennis tournaments, and design HydraNet, a momentum-driven state-space duality-based framework, to model MS by integrating thirty-two heterogeneous dimensions of athletes performance in serve, return, psychology and fatigue. HydraNet integrates a Hydra module, which builds upon a state-space duality (SSD) framework, capturing explicit momentum with a sliding-window mechanism and implicit momentum through cross-game state propagation. It also introduces a novel Versus Learning method to better enhance the adversarial nature of momentum between the two athletes at a macro level, along with a Collaborative-Adversarial Attention Mechanism (CAAM) for capturing and integrating intra-player and inter-player dynamic momentum at a micro level. Additionally, we construct a million-level tennis cross-tournament dataset spanning from 2012-2023 Wimbledon and 2013-2023 US Open, and validate the multi-granularity modeling capability of HydraNet for the MS metric on this dataset. Extensive experimental evaluations demonstrate that the MS metric constructed by the HydraNet framework provides actionable insights into how momentum impacts outcomes at different granularities, establishing a new foundation for momentum modeling and sports analysis. To the best of our knowledge, this is the first work to explore and effectively model momentum across multiple granularities in professional tennis tournaments.
Chinese: 本研究首次在职业网球赛事中引入动量评分指标和HydraNet框架,通过多粒度分析和对抗学习机制,有效量化并建模了运动员在比赛不同层级的动量变化,为体育分析建立了新基础。
English: This study introduces a novel Momentum Score (MS) metric and the HydraNet framework to quantify and model momentum across multiple granularities in tennis tournaments, using a comprehensive dataset and advanced mechanisms to capture both explicit and implicit momentum dynamics.

Authors:Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
Title: Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection
Abstract:
Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.
中文摘要:Cross-DINO通过结合深度MLP网络和交叉二次编码模块增强小物体特征细节,同时引入类别-尺寸软标签及提升损失函数来提高分类预测分数,以更少参数实现了优于现有模型的检测性能。
English Summary: Cross-DINO enhances small object detection by integrating deep MLP networks and a Cross Coding Twice Module to improve feature details, while introducing Category-Size soft labels with Boost Loss to boost class prediction scores, achieving superior performance over existing models with fewer parameters.

Authors:Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon
Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging
Abstract:
Language-image pre-training has demonstrated strong performance in 2D medical imaging, but its success in 3D modalities such as CT and MRI remains limited due to the high computational demands of volumetric data, which pose a significant barrier to training on large-scale, uncurated clinical studies. In this study, we introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging. HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study. This mechanism exhibits strong generalizability, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. Moreover, the computational efficiency of HLIP enables direct training on uncurated datasets. Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +32.4% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +1.4% and +6.9% macro AUC on head CT benchmarks RSNA and CQ500, respectively. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip
中文: HLIP框架采用分层注意力机制,实现了对大规模未筛选3D医学影像的高效语言-图像预训练,在多个CT和MRI基准测试中均取得了最先进的性能表现。
English: The HLIP framework introduces a hierarchical attention mechanism for 3D medical imaging that enables efficient language-image pre-training on large-scale uncurated datasets, achieving state-of-the-art performance across multiple CT and MRI benchmarks.

Authors:Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon
Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging
Abstract:
The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist's workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.
中文: HLIP框架采用分层注意力机制,实现了对大规模未筛选3D医学影像的高效语言-图像预训练,在多个CT和MRI基准测试中均取得了最先进的性能表现。
English: The HLIP framework introduces a hierarchical attention mechanism for 3D medical imaging that enables efficient language-image pre-training on large-scale uncurated datasets, achieving state-of-the-art performance across multiple CT and MRI benchmarks.

Authors:Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang
Title: RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers
Abstract:
We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.
中文: 本研究揭示前馈网络层是视觉Transformer推理延迟的主要瓶颈,并提出通道空闲机制实现结构重参数化,由此开发的RePaViT模型在保持精度的同时大幅提升推理速度。
English: This study identifies feedforward network layers as the main bottleneck in Vision Transformer inference latency and introduces a channel idle mechanism enabling structural reparameterization to create efficient RePaViT models that achieve significant speed gains with minimal accuracy loss.

Authors:Chen Yueh-Han, Guy Davidson, Brenden M. Lake
Title: SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
Abstract:
Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.
中文:SAGE-Eval评估基准显示,大型语言模型(包括仅达58%通过率的顶尖模型Claude-3.7-sonnet)无法将既定安全知识可靠应用于新场景,表明单纯扩大模型规模无法解决这一关键安全隐患。
English: The SAGE-Eval benchmark reveals that large language models, including top-performing Claude-3.7-sonnet with only 58% accuracy, fail to reliably apply established safety facts to novel scenarios, indicating that scaling alone cannot resolve this critical safety gap.

Authors:Clark Mingxuan Ju, Leonardo Neves, Bhuvesh Kumar, Liam Collins, Tong Zhao, Yuwei Qiu, Qing Dou, Sohail Nizam, Sen Yang, Neil Shah
Title: Revisiting Self-attention for Cross-domain Sequential Recommendation
Abstract:
Sequential recommendation is a popular paradigm in modern recommender systems. In particular, one challenging problem in this space is cross-domain sequential recommendation (CDSR), which aims to predict future behaviors given user interactions across multiple domains. Existing CDSR frameworks are mostly built on the self-attention transformer and seek to improve by explicitly injecting additional domain-specific components (e.g. domain-aware module blocks). While these additional components help, we argue they overlook the core self-attention module already present in the transformer, a naturally powerful tool to learn correlations among behaviors. In this work, we aim to improve the CDSR performance for simple models from a novel perspective of enhancing the self-attention. Specifically, we introduce a Pareto-optimal self-attention and formulate the cross-domain learning as a multi-objective problem, where we optimize the recommendation task while dynamically minimizing the cross-domain attention scores. Our approach automates knowledge transfer in CDSR (dubbed as AutoCDSR) -- it not only mitigates negative transfer but also encourages complementary knowledge exchange among auxiliary domains. Based on the idea, we further introduce AutoCDSR+, a more performant variant with slight additional cost. Our proposal is easy to implement and works as a plug-and-play module that can be incorporated into existing transformer-based recommenders. Besides flexibility, it is practical to deploy because it brings little extra computational overheads without heavy hyper-parameter tuning. AutoCDSR on average improves Recall@10 for SASRec and Bert4Rec by 9.8% and 16.0% and NDCG@10 by 12.0% and 16.7%, respectively. Code is available at https://github.com/snap-research/AutoCDSR.
中文: 本文提出AutoCDSR方法,通过将跨域序列推荐建模为多目标优化问题来增强自注意力机制,实现自动知识迁移,在提升推荐性能的同时保持较低计算成本。
English: This paper introduces AutoCDSR, a novel approach that enhances self-attention in transformers for cross-domain sequential recommendation by formulating it as a multi-objective optimization problem to automate knowledge transfer and improve performance with minimal computational overhead.

Authors:Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone
Title: SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Abstract:
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.
中文: SANSA框架通过显式对齐Segment Anything 2(SAM2)的潜在语义特征,以最小改动实现了最先进的少样本分割性能,在保持高效性的同时支持多种交互提示方式。
English: The SANSA framework enhances the Segment Anything 2 (SAM2) model by aligning its latent semantic features, enabling state-of-the-art few-shot segmentation with minimal modifications while maintaining efficiency and flexibility across various prompts.

Authors:Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Title: FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Abstract:
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
中文: 本文提出了FRAMES-VQA新基准,用于评估视觉问答任务中针对多模态分布变化的鲁棒微调方法,通过将数据集分类为分布内和分布外类型并分析模态交互作用,为开发更鲁棒的微调方法提供指导。
English: This paper introduces FRAMES-VQA, a new benchmark for evaluating robust fine-tuning in visual question answering across multi-modal distribution shifts, categorizing datasets into in-distribution and out-of-distribution types while analyzing modality interactions to guide future method development.

Authors:Yitong Li, Morteza Ghahremani, Christian Wachinger
Title: MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis
Abstract:
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to pronounced domain shifts. At the same time, training a medical foundation model requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three key components. First, a Focal Sampling module that extracts high-resolution local regions to capture subtle pathological features and compensate for the limited input resolution of general-purpose VLMs. Second, a Query Encoder (QEncoder) injects a small set of learnable queries that attend to the frozen feature maps of VLM, aligning them with medical semantics without retraining the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of diverse VLMs to maximize diagnostic performance. We evaluate MedBridge on five medical imaging benchmarks across three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings, even under varying levels of training data availability. Notably, MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging foundation models for accurate and data-efficient medical diagnosis. Our code is available at https://github.com/ai-med/MedBridge.
中文摘要:MedBridge是一种轻量级多模态适配框架,通过聚焦采样、查询编码和专家混合机制,将预训练的视觉语言模型高效应用于医学图像诊断,在多项基准测试中显著提升性能且资源消耗极低。
English Summary: MedBridge is a lightweight multimodal framework that adapts pretrained vision-language models for medical image diagnosis by incorporating focal sampling, query encoding, and mixture of experts to achieve superior performance with minimal computational overhead.

Authors:Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Chuchu Fan
Title: R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Abstract:
Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
中文摘要:本文提出R1-Code-Interpreter,通过多阶段课程学习方法训练大语言模型跨领域自主生成代码查询,在多项测试任务中显著超越了GPT-4o模型的性能表现。
English Summary: This paper introduces R1-Code-Interpreter, a multi-stage curriculum learning approach that trains LLMs to autonomously generate code queries across diverse tasks, achieving significant performance improvements over GPT-4o models.

Authors:Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Na Li, Chuchu Fan
Title: R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Abstract:
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4% to +9.3% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
中文摘要:本文提出R1-Code-Interpreter,通过多阶段课程学习方法训练大语言模型跨领域自主生成代码查询,在多项测试任务中显著超越了GPT-4o模型的性能表现。
English Summary: This paper introduces R1-Code-Interpreter, a multi-stage curriculum learning approach that trains LLMs to autonomously generate code queries across diverse tasks, achieving significant performance improvements over GPT-4o models.

Authors:Owen Oertell, Shikun Sun, Yiding Chen, Jin Peng Zhou, Zhiyong Wang, Wen Sun
Title: Efficient Controllable Diffusion via Optimal Classifier Guidance
Abstract:
The controllable generation of diffusion models aims to steer the model to generate samples that optimize some given objective functions. It is desirable for a variety of applications including image generation, molecule generation, and DNA/sequence generation. Reinforcement Learning (RL) based fine-tuning of the base model is a popular approach but it can overfit the reward function while requiring significant resources. We frame controllable generation as a problem of finding a distribution that optimizes a KL-regularized objective function. We present SLCD -- Supervised Learning based Controllable Diffusion, which iteratively generates online data and trains a small classifier to guide the generation of the diffusion model. Similar to the standard classifier-guided diffusion, SLCD's key computation primitive is classification and does not involve any complex concepts from RL or control. Via a reduction to no-regret online learning analysis, we show that under KL divergence, the output from SLCD provably converges to the optimal solution of the KL-regularized objective. Further, we empirically demonstrate that SLCD can generate high quality samples with nearly the same inference time as the base model in both image generation with continuous diffusion and biological sequence generation with discrete diffusion. Our code is available at https://github.com/Owen-Oertell/slcd
中文摘要:SLCD是一种基于监督学习的可控扩散模型方法,通过迭代生成在线数据并训练小型分类器来引导生成过程,在理论上保证收敛的同时保持与基础模型相近的快速推理速度。
English Summary: SLCD is a supervised learning method for controllable diffusion models that uses iterative online data generation and classifier training to guide sampling, achieving provable convergence and maintaining fast inference times comparable to the base model.

Authors:Xiaole Tang, Xiaoyi He, Xiang Gu, Jian Sun
Title: BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration
Abstract:
Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at https://github.com/xl-tang3/BaryIR.
中文: BaryIR提出了一种多源表示学习框架,通过将图像退化分解为统一的重心空间和特定源子空间,显著提升了全功能图像恢复方法对现实场景和未知退化的泛化能力。
English: BaryIR introduces a multi-source representation learning framework that decomposes image degradations into a unified barycenter space and source-specific subspaces, enhancing generalization for all-in-one image restoration across diverse real-world scenarios.

Authors:Chengyu Yang, Chengjun Liu
Title: Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter
Abstract:
Laparoscopic surgeries often suffer from reduced visual clarity due to the presence of surgical smoke originated by surgical instruments, which poses significant challenges for both surgeons and vision based computer-assisted technologies. In order to remove the surgical smoke, a novel U-Net deep learning with new loss function and integrated differentiable Wiener filter (ULW) method is presented. Specifically, the new loss function integrates the pixel, structural, and perceptual properties. Thus, the new loss function, which combines the structural similarity index measure loss, the perceptual loss, as well as the mean squared error loss, is able to enhance the quality and realism of the reconstructed images. Furthermore, the learnable Wiener filter is capable of effectively modelling the degradation process caused by the surgical smoke. The effectiveness of the proposed ULW method is evaluated using the publicly available paired laparoscopic smoke and smoke-free image dataset, which provides reliable benchmarking and quantitative comparisons. Experimental results show that the proposed ULW method excels in both visual clarity and metric-based evaluation. As a result, the proposed ULW method offers a promising solution for real-time enhancement of laparoscopic imagery. The code is available at https://github.com/chengyuyang-njit/ImageDesmoke.
中文摘要:ULW方法结合新型U-Net网络、综合损失函数和可微维纳滤波器,能有效消除腹腔镜图像中的手术烟雾,显著提升图像清晰度,为实时手术应用提供可靠解决方案。
English Summary: The ULW method, combining a novel U-Net with an integrated loss function and differentiable Wiener filter, effectively removes surgical smoke from laparoscopic images to enhance visual clarity and support real-time surgical applications.

Authors:Zhengyuan Jiang, Moyang Guo, Kecen Li, Yuepeng Hu, Yupu Wang, Zhicong Huang, Cheng Hong, Neil Zhenqiang Gong
Title: VideoMarkBench: Benchmarking Robustness of Video Watermarking
Abstract:
The rapid development of video generative models has led to a surge in highly realistic synthetic videos, raising ethical concerns related to disinformation and copyright infringement. Recently, video watermarking has been proposed as a mitigation strategy by embedding invisible marks into AI-generated videos to enable subsequent detection. However, the robustness of existing video watermarking methods against both common and adversarial perturbations remains underexplored. In this work, we introduce VideoMarkBench, the first systematic benchmark designed to evaluate the robustness of video watermarks under watermark removal and watermark forgery attacks. Our study encompasses a unified dataset generated by three state-of-the-art video generative models, across three video styles, incorporating four watermarking methods and seven aggregation strategies used during detection. We comprehensively evaluate 12 types of perturbations under white-box, black-box, and no-box threat models. Our findings reveal significant vulnerabilities in current watermarking approaches and highlight the urgent need for more robust solutions. Our code is available at https://github.com/zhengyuan-jiang/VideoMarkBench.
Chinese: 本文提出了首个系统性评估视频水印在去除和伪造攻击下鲁棒性的基准VideoMarkBench,揭示了现有方法的显著脆弱性,并强调了开发更强健解决方案的迫切性。
English: This paper introduces VideoMarkBench, the first benchmark to systematically evaluate the vulnerability of video watermarks against removal and forgery attacks, revealing significant weaknesses in current methods and underscoring the need for more robust solutions.

Authors:Miao Peng, Nuo Chen, Jianheng Tang, Jia Li
Title: How does Misinformation Affect Large Language Model Behaviors and Preferences?
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.
Chinese: 大型语言模型在知识冲突和风格变化方面对错误信息表现出显著脆弱性,为此我们构建了MisBench这一全面基准,通过提出的重构判别方法(RtD)来评估并增强其检测能力。
English: Large Language Models (LLMs) demonstrate significant vulnerability to misinformation, particularly in knowledge conflicts and stylistic variations, leading to the creation of MisBench—a comprehensive benchmark to evaluate and enhance their detection capabilities through the proposed Reconstruct to Discriminate (RtD) method.

Authors:Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Title: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Abstract:
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
中文: Roads to Rome (R2R)方法通过神经令牌路由技术,仅将关键推理令牌交由大模型处理,其余生成任务由小模型承担,在保持相当性能的同时实现2.8倍加速,显著提升了推理效率边界。
English: The Roads to Rome (R2R) method selectively routes only critical reasoning tokens to large language models while delegating most generation to small models, achieving comparable performance with 2.8x speedup and advancing efficiency frontiers.

Authors:Shreyas Gururaj, Lars Grüne, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
Title: Relevance-driven Input Dropout: an Explanation-guided Regularization Technique
Abstract:
Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models, resulting in reduced generalization, and a significant train-test performance gap. Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques. Among the various data augmentation strategies, occlusion is a prominent technique that typically focuses on randomly masking regions of the input during training. Most of the existing literature emphasizes randomness in selecting and modifying the input features instead of regions that strongly influence model decisions. We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input, nudging the model to use other important features in the prediction process, thus improving model generalization through informed regularization. We further conduct qualitative and quantitative analyses to study how Relevance-driven Input Dropout (RelDrop) affects model decision-making. Through a series of experiments on benchmark datasets, we demonstrate that our approach improves robustness towards occlusion, results in models utilizing more features within the region of interest, and boosts inference time generalization performance. Our code is available at https://github.com/Shreyas-Gururaj/LRP_Relevance_Dropout.
Chinese: 针对机器学习模型中的过拟合问题,RelDrop作为一种新颖的数据增强方法,通过选择性遮挡输入中最相关的区域,促使模型利用其他重要特征,从而提升泛化能力和鲁棒性。
English: Overfitting in machine learning models can be mitigated by RelDrop, a novel data augmentation technique that selectively occludes the most relevant input regions to encourage the use of other important features, thereby enhancing generalization and robustness.

Authors:Carina Newen, Luca Hinkamp, Maria Ntonti, Emmanuel Müller
Title: Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI
Abstract:
From uncertainty quantification to real-world object detection, we recognize the importance of machine learning algorithms, particularly in safety-critical domains such as autonomous driving or medical diagnostics. In machine learning, ambiguous data plays an important role in various machine learning domains. Optical illusions present a compelling area of study in this context, as they offer insight into the limitations of both human and machine perception. Despite this relevance, optical illusion datasets remain scarce. In this work, we introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity. We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy. By confronting models with perceptual ambiguity, our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision. To make this dataset useful for general purposes, we generate optical illusions systematically with different concepts discussed in our bias mitigation section. The dataset is accessible in Kaggle via https://kaggle.com/datasets/693bf7c6dd2cb45c8a863f9177350c8f9849a9508e9d50526e2ffcc5559a8333. Our source code can be found at https://github.com/KDD-OpenSource/Ambivision.git.
中文: 本研究引入了一个新颖的视觉错觉数据集,旨在探究机器学习中的感知模糊性,揭示了如注视方向等细微视觉线索对模型准确性和偏差缓解的重要影响。
English: This study introduces a novel dataset of optical illusions to explore perceptual ambiguity in machine learning, highlighting how subtle visual cues like gaze direction impact model accuracy and bias mitigation.

Authors:Huacan Wang, Ziyi Ni, Shuo Zhang, Shuo Lu, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Ronghao Chen, Xin Li, Daxin Jiang, Yuntao Du, Pin Lyu
Title: RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving
Abstract:
The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 40.7% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/QuantaAlpha/RepoMaster.
中文摘要:RepoMaster是一种自主代理框架,通过分析代码结构和依赖关系有效利用GitHub仓库资源,在复杂编程任务中实现了任务通过率大幅提升和令牌使用量显著降低。
English Summary: RepoMaster is an autonomous agent framework that overcomes challenges in reusing GitHub repositories by analyzing code structures and dependencies, achieving significant performance improvements and reduced token usage in complex coding tasks.

Authors:Jiawei Tang, Yuheng Jia
Title: Concentration Distribution Learning from Label Distributions
Abstract:
Label distribution learning (LDL) is an effective method to predict the relative label description degree (a.k.a. label distribution) of a sample. However, the label distribution is not a complete representation of an instance because it overlooks the absolute intensity of each label. Specifically, it's impossible to obtain the total description degree of hidden labels that not in the label space, which leads to the loss of information and confusion in instances. To solve the above problem, we come up with a new concept named background concentration to serve as the absolute description degree term of the label distribution and introduce it into the LDL process, forming the improved paradigm of concentration distribution learning. Moreover, we propose a novel model by probabilistic methods and neural networks to learn label distributions and background concentrations from existing LDL datasets. Extensive experiments prove that the proposed approach is able to extract background concentrations from label distributions while producing more accurate prediction results than the state-of-the-art LDL methods. The code is available in https://github.com/seutjw/CDL-LD.
中文: 标签分布学习通过引入背景浓度的概念,解决了标签绝对强度缺失的问题,提出了一种结合概率方法和神经网络的新模型,有效提升了预测精度。
English: Label distribution learning is enhanced by introducing the concept of background concentration to address the limitation of missing absolute label intensity, leading to a new model that improves prediction accuracy through probabilistic and neural network methods.

Authors:Yao Lu, Tengfei Ma, Zeyu Wang, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
Title: FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition
Abstract:
With the rapid development of wireless communications and the growing complexity of digital modulation schemes, traditional manual modulation recognition methods struggle to extract reliable signal features and meet real-time requirements in modern scenarios. Recently, deep learning based Automatic Modulation Recognition (AMR) approaches have greatly improved classification accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained devices. Model pruning provides a general approach to reduce model complexity, but existing weight, channel, and layer pruning techniques each present a trade-off between compression rate, hardware acceleration, and accuracy preservation. To this end, in this paper, we introduce FCOS, a novel Fine-to-COarse two-Stage pruning framework that combines channel-level pruning with layer-level collapse diagnosis to achieve extreme compression, high performance and efficient inference. In the first stage of FCOS, hierarchical clustering and parameter fusion are applied to channel weights to achieve channel-level pruning. Then a Layer Collapse Diagnosis (LaCD) module uses linear probing to identify layer collapse and removes the collapsed layers due to high channel compression ratio. Experiments on multiple AMR benchmarks demonstrate that FCOS outperforms existing channel and layer pruning methods. Specifically, FCOS achieves 95.51% FLOPs reduction and 95.31% parameter reduction while still maintaining performance close to the original ResNet56, with only a 0.46% drop in accuracy on Sig2019-12. Code is available at https://github.com/yaolu-zjut/FCOS.
中文:FCOS框架提出了一种两阶段剪枝方法,通过通道级剪枝与层坍塌诊断相结合,在保持高精度的同时实现极致模型压缩,在自动调制识别基准测试中显著优于现有方法。
English: The FCOS framework introduces a two-stage pruning method combining channel-level pruning with layer collapse diagnosis to achieve extreme model compression while maintaining high accuracy, significantly outperforming existing methods on automatic modulation recognition benchmarks.

Authors:Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Gerard de Melo, Haojin Yang
Title: Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
Abstract:
Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main
中文: 大型视觉语言模型因图像标记共现关联产生幻觉,本文通过抑制生成过程中视觉缺失标记的影响提出缓解方法,在保持表达力的同时有效减少幻觉现象。
English: Large Vision-Language Models often hallucinate objects due to strong associations between co-occurring image tokens, prompting the development of a mitigation method that suppresses visually absent tokens during generation to reduce errors while maintaining expressivity.

Authors:Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Title: Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation
Abstract:
Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm
中文: CAT-LVDM提出首个针对隐视频扩散模型的抗干扰训练框架,通过结构化的数据对齐噪声注入增强模型鲁棒性,在多个数据集上有效提升时间一致性并减少语义偏移。
English: CAT-LVDM introduces a corruption-aware training framework that enhances the robustness of Latent Video Diffusion Models against imperfect conditioning through structured noise injection, significantly improving temporal consistency and reducing semantic drift across multiple datasets.

Authors:Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Yiren Song
Title: DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
Abstract:
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.
中文: 本文提出DiffDecompose框架,基于扩散Transformer和新型AlphaBlend数据集,通过上下文分解方法解决半透明/透明图层分解中的层次模糊和数据稀缺问题。
English: This paper introduces DiffDecompose, a diffusion Transformer framework that addresses layer decomposition in alpha-composited images using a novel AlphaBlend dataset and in-context learning to overcome challenges like layer ambiguity and data scarcity.

Authors:Thalles Silva, Helio Pedrini, Adín Ramírez Rivera
Title: Self-Organizing Visual Prototypes for Non-Parametric Representation Learning
Abstract:
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
中文: 本文提出自组织视觉原型(SOP)方法,通过多个互补的支持嵌入替代单一原型来更好表征数据簇,在多项基准测试中实现了最先进的性能。
English: This paper introduces Self-Organizing Visual Prototypes (SOP), an unsupervised visual feature learning method that uses multiple complementary support embeddings instead of single prototypes to better characterize data clusters and achieve state-of-the-art performance on various benchmarks.

Authors:Mokai Pan, Kaizhen Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Title: UniDB++: Fast Sampling of Unified Diffusion Bridge
Abstract:
Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB's reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method's key advancement comes from deriving exact closed-form solutions for UniDB's reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20$\times$ fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at https://github.com/2769433owo/UniDB-plusplus.
中文摘要:UniDB++ 作为一种免训练采样算法,通过推导 UniDB 随机微分方程的精确闭式解,解决了原有模型推理缓慢的问题,在保持感知质量的同时实现了20倍加速的高质量图像生成,并在图像修复任务中达到领先性能。
English Summary: UniDB++ is a training-free sampling algorithm that overcomes UniDB's slow inference by deriving exact closed-form solutions for its SDEs, enabling 20× faster high-quality image generation while maintaining perceptual quality through data prediction and SDE-corrector mechanisms.

Authors:Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
Title: Learning Shared Representations from Unpaired Data
Abstract:
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.
中文: 本研究证明,通过单模态表示的光谱嵌入,可以主要从未配对数据中有效学习共享的多模态表示,在各种跨模态任务中表现出色,并建立了一种可能通用的嵌入方法。
English: This research demonstrates that shared multimodal representations can be effectively learned primarily from unpaired data using spectral embeddings from unimodal representations, achieving strong performance across various cross-modal tasks and establishing a potentially universal embedding method.

Authors:Haowei Wang, Junjie Wang, Xiaojun Jia, Rupeng Zhang, Mingyang Li, Zhe Liu, Yang Liu, Qing Wang
Title: AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery
Abstract:
Vision-Language Model (VLM) based Web Agents represent a significant step towards automating complex tasks by simulating human-like interaction with websites. However, their deployment in uncontrolled web environments introduces significant security vulnerabilities. Existing research on adversarial environmental injection attacks often relies on unrealistic assumptions, such as direct HTML manipulation, knowledge of user intent, or access to agent model parameters, limiting their practical applicability. In this paper, we propose AdInject, a novel and real-world black-box attack method that leverages the internet advertising delivery to inject malicious content into the Web Agent's environment. AdInject operates under a significantly more realistic threat model than prior work, assuming a black-box agent, static malicious content constraints, and no specific knowledge of user intent. AdInject includes strategies for designing malicious ad content aimed at misleading agents into clicking, and a VLM-based ad content optimization technique that infers potential user intents from the target website's context and integrates these intents into the ad content to make it appear more relevant or critical to the agent's task, thus enhancing attack effectiveness. Experimental evaluations demonstrate the effectiveness of AdInject, attack success rates exceeding 60% in most scenarios and approaching 100% in certain cases. This strongly demonstrates that prevalent advertising delivery constitutes a potent and real-world vector for environment injection attacks against Web Agents. This work highlights a critical vulnerability in Web Agent security arising from real-world environment manipulation channels, underscoring the urgent need for developing robust defense mechanisms against such threats. Our code is available at https://github.com/NicerWang/AdInject.
中文摘要:AdInject是一种利用互联网广告向网络智能体注入恶意内容的新型黑盒攻击方法,通过优化广告相关性显著提高攻击成功率,无需了解智能体参数或用户意图。
English Summary: AdInject is a practical black-box attack that exploits internet advertising to inject malicious content into Web Agents, achieving high success rates by optimizing ad relevance without requiring agent details or user intent.

Authors:Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Abstract:
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
中文总结:本研究提出了首个学术海报生成基准和PosterAgent流程,在视觉文本连贯性和成本效益上超越现有方法,同时揭示了自动化设计中的关键瓶颈。
English Summary: This research introduces the first benchmark and PosterAgent pipeline for academic poster generation, which outperforms existing methods in visual-textual coherence and cost-efficiency while identifying key bottlenecks in automated design.

Authors:Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
Abstract:
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
中文: 本文提出UI-Genie自改进框架,通过专用奖励模型和自动化数据生成解决GUI智能体的两大难题,在无需人工标注的情况下实现了最优性能。
English: This paper presents UI-Genie, a self-improving framework that tackles GUI agent challenges through a specialized reward model and automated data generation, achieving state-of-the-art results without manual annotation.

Authors:Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu
Title: Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Abstract:
Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.
Chinese: 提出的FOA-Attack方法通过结合全局余弦相似度对齐与基于聚类的局部最优传输,增强了多模态大语言模型的可迁移对抗攻击能力,并利用动态集成加权策略显著提升了对闭源模型的攻击效果。
English: The proposed FOA-Attack method enhances transferable adversarial attacks on multimodal large language models by combining global cosine similarity alignment with local clustering-based optimal transport, significantly improving performance against closed-source models through dynamic ensemble weighting.

Authors:Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du
Title: Reinforcing General Reasoning without Verifiers
Abstract:
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Chinese: 提出的VeriFree方法通过直接最大化生成参考答案的概率,绕过了强化学习中的答案验证环节,在保持与验证器方法相当甚至更优性能的同时,显著提升了实用性和计算效率。
English: The proposed VeriFree method eliminates the need for answer verification in reinforcement learning by directly maximizing the probability of generating reference answers, demonstrating comparable or superior performance to verifier-based approaches while offering significant practical and computational advantages.

Authors:Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Title: Are Language Models Consequentialist or Deontological Moral Reasoners?
Abstract:
As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at https://github.com/keenansamway/moral-lens .
中文摘要:本研究通过600多个电车难题对大型语言模型进行大规模道德推理分析,发现其思维链推理倾向于义务论原则,而事后解释则明显转向强调效用的功利主义理据。
English Summary: This study conducts a large-scale analysis of moral reasoning in large language models using over 600 trolley problems, revealing that while their chain-of-thought reasoning favors deontological principles, their post-hoc explanations shift toward consequentialist rationales.

Authors:Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration
Abstract:
With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, $\textbf{ExtAgents}$, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$. Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.
中文:ExtAgents多智能体框架通过分布式知识集成突破了上下文窗口限制,无需长文本训练即可实现高效扩展,在处理海量外部知识的任务中显著优于现有方法,同时保持高度并行化的运行效率。
English: The ExtAgents multi-agent framework overcomes context window limitations by enabling scalable, parallel knowledge integration without extended training, significantly outperforming existing methods on tasks requiring extensive external data while maintaining high efficiency.

Authors:Bozhou Li, Wentao Zhang
Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Abstract:
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
Chinese: 针对视觉语言模型中同时编码高分辨率和缩略图图像导致的效率低下及交互限制问题,ID-Align方法通过重新排列位置标识来增强标记间交互,在多个基准测试中实现了显著性能提升。
English: To address the inefficiency and interaction limitations caused by encoding both high-resolution and thumbnail images in Vision-Language Models, the proposed ID-Align method reorders position IDs to enhance token interaction and achieves significant performance improvements across multiple benchmarks.

Authors:Xiao Liu, Da Yin, Zirui Wu, Yansong Feng
Title: RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation
Abstract:
Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.
中文摘要:RefTool框架通过利用教科书等结构化外部参考资料,指导大语言模型生成可执行工具并分层管理,有效突破了模型的知识局限,在因果推理、物理和化学任务中平均准确率提升11.3%。
English Summary: RefTool is a framework that enables large language models to create executable tools using structured external references like textbooks, overcoming knowledge limitations and improving reasoning accuracy by 11.3% through hierarchical organization and validation.

Authors:Maria Cristina Carrisi, Mirko Marras, Sara Vergallo
Title: A Structured Unplugged Approach for Foundational AI Literacy in Primary Education
Abstract:
Younger generations are growing up in a world increasingly shaped by intelligent technologies, making early AI literacy crucial for developing the skills to critically understand and navigate them. However, education in this field often emphasizes tool-based learning, prioritizing usage over understanding the underlying concepts. This lack of knowledge leaves non-experts, especially children, prone to misconceptions, unrealistic expectations, and difficulties in recognizing biases and stereotypes. In this paper, we propose a structured and replicable teaching approach that fosters foundational AI literacy in primary students, by building upon core mathematical elements closely connected to and of interest in primary curricula, to strengthen conceptualization, data representation, classification reasoning, and evaluation of AI. To assess the effectiveness of our approach, we conducted an empirical study with thirty-one fifth-grade students across two classes, evaluating their progress through a post-test and a satisfaction survey. Our results indicate improvements in terminology understanding and usage, features description, logical reasoning, and evaluative skills, with students showing a deeper comprehension of decision-making processes and their limitations. Moreover, the approach proved engaging, with students particularly enjoying activities that linked AI concepts to real-world reasoning. Materials: https://github.com/tail-unica/ai-literacy-primary-ed.
中文摘要:本文提出了一种结构化教学方法,通过结合核心数学概念来提升小学生的AI素养,实证研究表明该方法有效增强了学生对术语理解、逻辑推理和评估能力,并激发了学习兴趣。
English Summary: This paper introduces a structured teaching approach to enhance AI literacy in primary students by integrating core mathematical concepts, which improved their understanding of terminology, reasoning, and evaluative skills in an engaging, real-world context.

Authors:Xiusi Chen, Shanyong Wang, Cheng Qian, Hongru Wang, Peixuan Han, Heng Ji
Title: DecisionFlow: Advancing Large Language Model as Principled Decision Maker
Abstract:
In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. Code and data are at https://github.com/xiusic/DecisionFlow.
Chinese: DecisionFlow提出了一种结构化推理框架,通过构建语义决策空间并评估权衡来指导语言模型进行透明、效用驱动的决策,在关键领域实现了显著准确性提升与结果一致性增强。
English: DecisionFlow introduces a structured reasoning framework that enables language models to make transparent, utility-driven decisions by evaluating trade-offs within a semantically grounded decision space, achieving significant accuracy improvements and enhanced outcome alignment in high-stakes domains.

Authors:Jingyuan Huang, Xi Zhu, Minghao Guo, Yongfeng Zhang
Title: DeSocial: Blockchain-based Decentralized Social Networks
Abstract:
Web 2.0 social platforms are inherently centralized, with user data and algorithmic decisions controlled by the platform. However, users can only passively receive social predictions without being able to choose the underlying algorithm, which limits personalization. Fortunately, with the emergence of blockchain, users are allowed to choose algorithms that are tailored to their local situation, improving prediction results in a personalized way. In a blockchain environment, each user possesses its own model to perform the social prediction, capturing different perspectives on social interactions. In our work, we propose DeSocial, a decentralized social network learning framework deployed on an Ethereum (ETH) local development chain that integrates distributed data storage, node-level consensus, and user-driven model selection through Ganache. In the first stage, each user leverages DeSocial to evaluate multiple backbone models on their local subgraph. DeSocial coordinates the execution and returns model-wise prediction results, enabling the user to select the most suitable backbone for personalized social prediction. Then, DeSocial uniformly selects several validation nodes that possess the algorithm specified by each user, and aggregates the prediction results by majority voting, to prevent errors caused by any single model's misjudgment. Extensive experiments show that DeSocial has an evident improvement compared to the five classical centralized social network learning models, promoting user empowerment in blockchain-based decentralized social networks, showing the importance of multi-node validation and personalized algorithm selection based on blockchain. Our implementation is available at: https://github.com/agiresearch/DeSocial.
中文摘要:DeSocial是以太坊上的去中心化框架,通过本地模型评估和多节点验证让用户选择个性化社交预测算法,其性能优于中心化模型。
English Summary: DeSocial is a decentralized framework on Ethereum that enables users to select personalized algorithms for social predictions through local model evaluation and multi-node validation, outperforming centralized models.

Authors:Xihong Yang, Siwei Wang, Fangdi Wang, Jiaqi Jin, Suyuan Liu, Yue Liu, En Zhu, Xinwang Liu, Yueming Jin
Title: Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios
Abstract:
Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at https://github.com/xihongyang1999/AIRMVC on Github.
中文: AIRMVC框架通过高斯混合模型识别异常并采用混合校正策略,有效处理多视图聚类中的噪声问题,在嘈杂环境下展现出优于现有方法的鲁棒性。
English: The AIRMVC framework addresses noise in multi-view clustering by identifying anomalies with GMM and applying a hybrid rectification strategy, enhancing robustness and outperforming existing methods in noisy environments.

Authors:Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan
Title: Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Abstract:
Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.
中文: 近期在思维链推理和强化学习后训练方面的进展提升了多模态大模型的视频推理能力,但现有基准无法充分评估复杂的人类式推理,因此我们开发了Video-Holmes基准,发现模型在整合分散线索方面存在困难,最高准确率仅达45%。
English: Recent progress in CoT reasoning and RL post-training has improved MLLMs' video reasoning, but existing benchmarks fall short in evaluating complex human-like reasoning, leading to the creation of Video-Holmes, a benchmark that reveals models struggle with integrating scattered clues, achieving only up to 45% accuracy.

Authors:Qi Yu, Zhichen Zeng, Yuchen Yan, Zhining Liu, Baoyu Jing, Ruizhong Qiu, Ariful Azad, Hanghang Tong
Title: PLANETALIGN: A Comprehensive Python Library for Benchmarking Network Alignment
Abstract:
Network alignment (NA) aims to identify node correspondence across different networks and serves as a critical cornerstone behind various downstream multi-network learning tasks. Despite growing research in NA, there lacks a comprehensive library that facilitates the systematic development and benchmarking of NA methods. In this work, we introduce PLANETALIGN, a comprehensive Python library for network alignment that features a rich collection of built-in datasets, methods, and evaluation pipelines with easy-to-use APIs. Specifically, PLANETALIGN integrates 18 datasets and 14 NA methods with extensible APIs for easy use and development of NA methods. Our standardized evaluation pipeline encompasses a wide range of metrics, enabling a systematic assessment of the effectiveness, scalability, and robustness of NA methods. Through extensive comparative studies, we reveal practical insights into the strengths and limitations of existing NA methods. We hope that PLANETALIGN can foster a deeper understanding of the NA problem and facilitate the development and benchmarking of more effective, scalable, and robust methods in the future. The source code of PLANETALIGN is available at https://github.com/yq-leo/PlanetAlign.
网络对齐旨在识别不同网络间的节点对应关系,而PLANETALIGN库作为一个全面的Python工具包,集成了丰富的数据集、方法和评估流程,以促进该领域方法的开发与性能评估。
Network alignment is a fundamental task for identifying node correspondences across networks, and the PLANETALIGN library provides a comprehensive Python toolkit with extensive datasets, methods, and evaluation pipelines to advance its development and benchmarking.

Authors:James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos
Title: Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
Abstract:
Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
中文: 多层感知机在大型语言模型中难以解释,但提出的解码器混合方法通过层级稀疏性克服了精度权衡,在保持表达能力的同时实现了更优的性能和可解释性。
English: Multilayer perceptrons in large language models are challenging to interpret, but the proposed Mixture of Decoders (MxDs) overcomes accuracy trade-offs through layer-level sparsity, achieving superior performance and interpretability while preserving expressive capacity.

Authors:Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi
Title: AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping
Abstract:
Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.
中文: 该摘要介绍了AgriFM,这是一个专为农作物制图设计的模型,通过改进的Video Swin Transformer架构整合多源卫星数据的多尺度时空特征,在各项任务中均优于现有先进方法。
English: The abstract introduces AgriFM, a specialized foundation model that enhances crop mapping by integrating multi-scale spatiotemporal features from satellite data using a modified Video Swin Transformer, achieving superior performance over existing methods.

Authors:Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
Title: HoliTom: Holistic Token Merging for Fast Video Large Language Models
Abstract:
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.
中文: HoliTom是一种无需训练的整体令牌合并框架,通过外层LLM剪枝与内层令牌相似度合并相结合,将计算成本降低90%以上同时保持99.1%的原始性能,显著加速了视频大语言模型的推理效率。
English: HoliTom is a training-free holistic token merging framework that combines outer-LLM pruning with inner-LLM token similarity-based merging to reduce computational costs by over 90% while maintaining 99.1% of original performance, significantly accelerating video LLM inference.

Authors:Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Mao Yang
Title: rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
Abstract:
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.
中文: rStar-Coder数据集通过提供41.8万个经过验证的竞赛级编程问题、测试用例和详细推理解决方案,显著提升了大型语言模型的代码推理能力,在多个基准测试中实现了突破性性能提升。
English: The rStar-Coder dataset enhances code reasoning in large language models by providing 418K verified competition-level problems with test cases and long-reasoning solutions, significantly boosting performance on benchmarks like LiveCodeBench and USA Computing Olympiad even with smaller models.

Authors:Shaoqing Zhang, Kehai Chen, Zhuosheng Zhang, Rumei Li, Rongxiang Weng, Yang Xiang, Liqiang Nie, Min Zhang
Title: XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration
Abstract:
Recent advancements in vision-language models (VLMs) have spurred increased interest in Device-Control Agents (DC agents), such as utilizing in-the-wild device control to manage graphical user interfaces. Conventional methods for assessing the capabilities of DC agents, such as computing step-wise action accuracy and overall task success rates, provide a macroscopic view of DC agents' performance; however, they fail to offer microscopic insights into potential errors that may occur in real-world applications. Conducting a finer-grained performance evaluation of DC agents presents significant challenges. This study introduces a new perspective on evaluation methods for DC agents by proposing the XBOUND evaluation method, which employs the calculation of a novel Explore Metric to delineate the capability boundaries of DC agents. Compared to previous evaluation methods, XBOUND focuses on individual states to assess the proficiency of DC agents in mastering these states. Furthermore, we have developed a ``pseudo'' episode tree dataset derived from Android Control test data. Utilizing this dataset and XBOUND, we comprehensively evaluate the OS-Atlas and UI-TARS series, examining both the overall and specific performance across five common tasks. Additionally, we select representative cases to highlight the current deficiencies and limitations inherent in both series. Code is available at https://github.com/sqzhang-lazy/XBOUND.
Chinese: 随着视觉语言模型的进步,设备控制代理在图形用户界面管理中的应用日益增多,为此我们提出了XBOUND这一状态级评估方法,揭示了代理性能的关键发现,如UI-TARS是最强的7B模型,而次7B模型在状态掌握上仍受限。
English: Recent advancements in vision-language models have spurred interest in Device-Control Agents for GUI management, leading to the development of XBOUND, a state-level evaluation method that reveals key insights about agent performance, including UI-TARS as the top 7B model and the limitations of sub-7B models.

Authors:Yao Huang, Yitong Sun, Shouwei Ruan, Yichi Zhang, Yinpeng Dong, Xingxing Wei
Title: Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Abstract:
Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.
中文: 该研究提出基于精细加工可能性模型和遗传优化的新框架,通过扩展越狱策略空间实现了对Claude-3.5等先进模型超过90%的攻击成功率,同时展现出强大的跨模型迁移能力并在评估准确率上超越专业防护模型。
English: This study introduces a novel framework that expands jailbreak strategy spaces using ELM theory and genetic optimization, achieving over 90% success against advanced models like Claude-3.5 while demonstrating strong transferability and surpassing safeguard models in accuracy.

Authors:Eve Le Guillou, Pierre Fortin, Julien Tierny
Title: Distributed Discrete Morse Sandwich: Efficient Computation of Persistence Diagrams for Massive Scalar Data
Abstract:
The persistence diagram, which describes the topological features of a dataset, is a key descriptor in Topological Data Analysis. The "Discrete Morse Sandwich" (DMS) method has been reported to be the most efficient algorithm for computing persistence diagrams of 3D scalar fields on a single node, using shared-memory parallelism. In this work, we extend DMS to distributed-memory parallelism for the efficient and scalable computation of persistence diagrams for massive datasets across multiple compute nodes. On the one hand, we can leverage the embarrassingly parallel procedure of the first and most time-consuming step of DMS (namely the discrete gradient computation). On the other hand, the efficient distributed computations of the subsequent DMS steps are much more challenging. To address this, we have extensively revised the DMS routines by contributing a new self-correcting distributed pairing algorithm, redesigning key data structures and introducing computation tokens to coordinate distributed computations. We have also introduced a dedicated communication thread to overlap communication and computation. Detailed performance analyses show the scalability of our hybrid MPI+thread approach for strong and weak scaling using up to 16 nodes of 32 cores (512 cores total). Our algorithm outperforms DIPHA, a reference method for the distributed computation of persistence diagrams, with an average speedup of x8 on 512 cores. We show the practical capabilities of our approach by computing the persistence diagram of a public 3D scalar field of 6 billion vertices in 174 seconds on 512 cores. Finally, we provide a usage example of our open-source implementation at https://github.com/eve-le-guillou/DDMS-example.
中文: 本研究将离散莫尔斯三明治方法扩展到分布式内存并行计算,显著提升了大规模三维数据集持久图的计算效率和可扩展性,性能远超现有方法。
English: The study extends the Discrete Morse Sandwich method to distributed-memory parallelism, enhancing scalability for computing persistence diagrams of massive 3D datasets and achieving significant speed improvements over existing methods.

Authors:M. Akin Yilmaz, Ahmet Bilican, A. Murat Tekalp
Title: DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution
Abstract:
Balancing reconstruction quality versus model efficiency remains a critical challenge in lightweight single image super-resolution (SISR). Despite the prevalence of attention mechanisms in recent state-of-the-art SISR approaches that primarily emphasize or suppress feature maps, alternative architectural paradigms warrant further exploration. This paper introduces DiMoSR (Dilated Modulation Super-Resolution), a novel architecture that enhances feature representation through modulation to complement attention in lightweight SISR networks. The proposed approach leverages multi-branch dilated convolutions to capture rich contextual information over a wider receptive field while maintaining computational efficiency. Experimental results demonstrate that DiMoSR outperforms state-of-the-art lightweight methods across diverse benchmark datasets, achieving superior PSNR and SSIM metrics with comparable or reduced computational complexity. Through comprehensive ablation studies, this work not only validates the effectiveness of DiMoSR but also provides critical insights into the interplay between attention mechanisms and feature modulation to guide future research in efficient network design. The code and model weights to reproduce our results are available at: https://github.com/makinyilmaz/DiMoSR
中文: DiMoSR提出了一种新颖的轻量级超分辨率架构,利用空洞卷积和特征调制技术,在保持高效计算的同时实现了领先的性能表现。
English: DiMoSR introduces a novel lightweight super-resolution architecture using dilated convolutions and feature modulation to achieve state-of-the-art performance with improved efficiency.

Authors:Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
Abstract:
Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
中文: 大型语言模型在涉及生僻词或高新颖性的困难场景中处理医学文本摘要时表现不佳,但通过词汇适应策略融入领域专业术语可显著提升其摘要质量。
English: Large Language Models (LLMs) struggle with medical text summarization in challenging scenarios involving out-of-vocabulary words or high novelty, but vocabulary adaptation significantly improves their performance by incorporating domain-specific terms.

Authors:Yu He, Zihan Yao, Chentao Song, Tianyu Qi, Jun Liu, Ming Li, Qing Huang
Title: LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners
Abstract:
Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at https://github.com/TAL-auroraX/LMCD
中文:提出的LMCD框架通过大语言模型增强习题语义并融合认知状态,解决了认知诊断中的冷启动问题,在冷启动场景下表现卓越。
English: The proposed LMCD framework overcomes cold-start challenges in cognitive diagnosis by using large language models to enrich exercise semantics and fuse them with cognitive states, achieving superior performance in cold-start scenarios.

Authors:Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, Hinrich Schütze
Title: Why Do More Experts Fail? A Theoretical Analysis of Model Merging
Abstract:
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model. Although recent model merging methods have shown promising results, they struggle to maintain performance gains as the number of merged models increases. In this paper, we investigate the key obstacles that limit the scalability of model merging when integrating a large number of expert models. First, we prove that there is an upper bound on model merging. Further theoretical analysis reveals that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged. Gaussian Width shows that the marginal benefit of merging additional models diminishes according to a strictly concave function. This implies that the effective parameter space becomes rapidly saturated as the number of merged models increases. Furthermore, using Approximate Kinematics Theory, we prove the existence of a unique optimal threshold beyond which adding more models does not yield significant performance improvements. At the same time, we introduce a straightforward Reparameterized Heavy-Tailed method (RHT) to extend the coverage of the merged model, thereby enhancing its performance. Empirical results on 12 benchmarks, including both knowledge-intensive and general-purpose tasks, validate our theoretical analysis. We believe that these results spark further research beyond the current scope of model merging. The source code is in the Github repository: https://github.com/wzj1718/ModelMergingAnalysis.
中文: 模型合并通过整合专家模型减少资源消耗,但受限于有效参数空间而难以扩展,我们引入新方法扩展覆盖范围并提升性能。
English: Model merging reduces resource usage by combining expert models but faces scalability issues due to limited effective parameter space, which we address with a new method that extends coverage and improves performance.

Authors:M. Mebratu, W. L. K. Wu
Title: Wavelet Flow For Extragalactic Foreground Simulations
Abstract:
Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcoming experiments. We explore the use of Wavelet Flow (WF) models to tackle the novel task of modeling the field-level probability distributions of multi-component CMB secondaries and foreground. Specifically, we jointly train correlated CMB lensing convergence ($κ$) and cosmic infrared background (CIB) maps with a WF model and obtain a network that statistically recovers the input to high accuracy -- the trained network generates samples of $κ$ and CIB fields whose average power spectra are within a few percent of the inputs across all scales, and whose Minkowski functionals are similarly accurate compared to the inputs. Leveraging the multiscale architecture of these models, we fine-tune both the model parameters and the priors at each scale independently, optimizing performance across different resolutions. These results demonstrate that WF models can accurately simulate correlated components of CMB secondaries, supporting improved analysis of cosmological data. Our code and trained models can be found here (https://github.com/matiwosm/HybridPriorWavletFlow.git).
中文: 小波流模型能有效模拟宇宙微波背景次级相关成分,如透镜收敛和宇宙红外背景,实现精确的场级概率分布建模,从而提升宇宙学数据分析的准确性。
English: Wavelet Flow models effectively simulate correlated cosmic microwave background secondaries like lensing convergence and cosmic infrared background, enabling accurate field-level probability distribution modeling for enhanced cosmological data analysis.

Authors:Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Hyuk Gi Hong, Jung-Oh Lee, Hangyul Yoon, Eun Woo Doe, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi
Title: Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
Abstract:
Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage
中文: 我们推出了LUNGUAGE,这是一个用于结构化放射学报告生成的基准数据集和框架,支持细粒度和纵向评估,同时提出了可解释的LUNGUAGESCORE指标,用于评估临床语义和时间一致性。
English: We introduce LUNGUAGE, a benchmark dataset and framework for structured radiology report generation that enables fine-grained and longitudinal evaluation, along with LUNGUAGESCORE, an interpretable metric for assessing clinical semantics and temporal consistency.

Authors:Hesam Araghi, Jan van Gemert, Nergis Tomen
Title: Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling
Abstract:
Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: https://github.com/hesamaraghi/event-camera-subsampling-methods.
Chinese: 本研究系统评估了事件相机的六种硬件友好型子采样方法,提出了一种基于事件密度的子采样策略,在稀疏场景下提升分类精度,同时分析了影响性能的关键因素,以平衡数据效率与任务准确性。
English: This study systematically evaluates six hardware-friendly subsampling methods for event cameras, introducing a density-based approach that improves classification accuracy in sparse regimes while analyzing key factors affecting performance to balance data efficiency and task accuracy.

Authors:Xurui Li, Zhonesheng Jiang, Tingxuan Ai, Yu Zhou
Title: RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images
Abstract:
Robust unsupervised anomaly detection (AD) in real-world scenarios is an important task. Current methods exhibit severe performance degradation on the MVTec AD 2 benchmark due to its complex real-world challenges. To solve this problem, we propose a robust framework RoBiS, which consists of three core modules: (1) Swin-Cropping, a high-resolution image pre-processing strategy to preserve the information of small anomalies through overlapping window cropping. (2) The data augmentation of noise addition and lighting simulation is carried out on the training data to improve the robustness of AD model. We use INP-Former as our baseline, which could generate better results on the various sub-images. (3) The traditional statistical-based binarization strategy (mean+3std) is combined with our previous work, MEBin (published in CVPR2025), for joint adaptive binarization. Then, SAM is further employed to refine the segmentation results. Compared with some methods reported by the MVTec AD 2, our RoBiS achieves a 29.2% SegF1 improvement (from 21.8% to 51.00%) on Test_private and 29.82% SegF1 gains (from 16.7% to 46.52%) on Test_private_mixed. Code is available at https://github.com/xrli-U/RoBiS.
Chinese: 提出的RoBiS框架通过集成Swin-Cropping保留小异常信息、数据增强提升鲁棒性以及自适应二值化结合SAM优化,在MVTec AD 2基准测试中实现了SegF1指标的显著提升。
English: The proposed RoBiS framework enhances unsupervised anomaly detection by integrating Swin-Cropping for small anomaly preservation, data augmentation for robustness, and adaptive binarization with SAM refinement, achieving significant SegF1 improvements on the MVTec AD 2 benchmark.

Authors:Sergey Karpukhin, Vadim Titov, Andrey Kuznetsov, Aibek Alanov
Title: FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention
Abstract:
In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.
中文摘要:本文提出FastFace框架,通过重新设计无分类器引导机制和解耦注意力操作,实现了预训练身份适配器在加速扩散模型中的免训练适配,并开发了新的分离式公共评估方案。
English Summary: The FastFace framework is proposed to enable training-free adaptation of identity-preserving adapters to accelerated diffusion models through redesigned classifier-free guidance and attention mechanisms, while also introducing a new evaluation protocol.

Authors:Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen
Title: SageAttention2++: A More Efficient Implementation of SageAttention2
Abstract:
The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.
中文: SageAttention2++ 通过采用更快的FP8矩阵乘法指令,在保持多种模型精度的同时,将注意力机制速度提升至FlashAttention的3.9倍。
English: SageAttention2++ significantly accelerates attention mechanisms by using faster FP8 matrix multiplication instructions, achieving a 3.9x speedup over FlashAttention while maintaining accuracy across multiple model types.

Authors:Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson
Title: Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance
Abstract:
Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig
中文: 本文指出无分类器引导(CFG)并不对应一个标准的扩散模型,并提出了一种吉布斯式采样方法进行修正,在条件生成任务中同时提升了样本质量和多样性。
English: This paper identifies that Classifier-Free Guidance (CFG) does not correspond to a proper diffusion model and proposes a Gibbs-like sampling method to correct it, enhancing both quality and diversity in conditional generation tasks.

Authors:Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull
Title: Instance Data Condensation for Image Super-Resolution
Abstract:
Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at https://github.com/.
中文: 本文提出了一种新颖的实例数据压缩框架,通过随机局部傅里叶特征提取和多级特征分布匹配,将DIV2K数据集压缩至10%体积,在保持与原数据集相当甚至更优性能的同时显著提升了训练效率和稳定性。
English: This paper introduces a novel Instance Data Condensation framework for image super-resolution that synthesizes a 10% condensed dataset from DIV2K, achieving comparable or superior performance to the full dataset while enhancing training efficiency and stability.

Authors:Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang
Title: Minute-Long Videos with Dual Parallelisms
Abstract:
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.
中文: 提出的DualParal策略通过分块去噪、特征缓存和协调噪声初始化,将视频帧和模型层处理分布到多个GPU上,实现了长视频的高效生成,并显著降低了延迟和内存开销。
English: The proposed DualParal strategy distributes video frame and model layer processing across GPUs using block-wise denoising with feature caching and coordinated noise initialization, enabling efficient generation of long videos with significantly reduced latency and memory costs.

Authors:Fatemeh Pesaran Zadeh, Yoojin Oh, Gunhee Kim
Title: LPOI: Listwise Preference Optimization for Vision Language Models
Abstract:
Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at https://github.com/fatemehpesaran310/lpoi.
中文: 本文提出LPOI方法,通过对象感知的列表偏好优化技术,对逐步显示物体的图像序列进行排序来减少视觉语言模型的幻觉现象,在无需额外标注的情况下超越了现有方法。
English: This paper introduces LPOI, an object-aware listwise preference optimization method that reduces hallucinations in vision-language models by ranking images with incrementally revealed objects, outperforming existing approaches without requiring additional annotations.

Authors:Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, Yang Liu
Title: Agent-Environment Alignment via Automated Interface Generation
Abstract:
Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as \textbf{agent-environment misalignment}. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose \textbf{ALIGN}, an \underline{A}uto-A\underline{l}igned \underline{I}nterface \underline{G}e\underline{n}eration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67\% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP-MT/ALIGN.
中文:该研究指出智能体与环境之间的错位是大型语言模型智能体性能的主要瓶颈,并提出了ALIGN自动对齐接口生成框架,通过增强环境信息和观察数据来解决这一错位问题,无需修改智能体逻辑或环境代码,在多个领域实现了显著的性能提升。
English: The study identifies agent-environment misalignment as a key performance bottleneck in LLM agents and introduces ALIGN, an auto-aligned interface generation framework that enhances environmental information and observations to resolve this issue without altering agent or environment code, achieving significant performance gains across various domains.

Authors:Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
Title: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis
Abstract:
In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on https://github.com/cwei01/FCKT.
中文摘要:本文提出FCKT框架,通过细粒度的跨任务知识迁移,在目标情感分析中结合方面级信息优化情感预测,有效缓解负迁移问题并提升任务性能。
English Summary: This paper introduces FCKT, a fine-grained cross-task knowledge transfer framework for targeted sentiment analysis that addresses limitations in existing methods by incorporating aspect-level information to improve sentiment prediction and reduce negative transfer.

Authors:Wenhu Li, Niki van Stein, Thomas Bäck, Elena Raponi
Title: LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms
Abstract:
Bayesian optimization (BO) is a powerful class of algorithms for optimizing expensive black-box functions, but designing effective BO algorithms remains a manual, expertise-driven task. Recent advancements in Large Language Models (LLMs) have opened new avenues for automating scientific discovery, including the automatic design of optimization algorithms. While prior work has used LLMs within optimization loops or to generate non-BO algorithms, we tackle a new challenge: Using LLMs to automatically generate full BO algorithm code. Our framework uses an evolution strategy to guide an LLM in generating Python code that preserves the key components of BO algorithms: An initial design, a surrogate model, and an acquisition function. The LLM is prompted to produce multiple candidate algorithms, which are evaluated on the established Black-Box Optimization Benchmarking (BBOB) test suite from the COmparing Continuous Optimizers (COCO) platform. Based on their performance, top candidates are selected, combined, and mutated via controlled prompt variations, enabling iterative refinement. Despite no additional fine-tuning, the LLM-generated algorithms outperform state-of-the-art BO baselines in 19 (out of 24) BBOB functions in dimension 5 and generalize well to higher dimensions, and different tasks (from the Bayesmark framework). This work demonstrates that LLMs can serve as algorithmic co-designers, offering a new paradigm for automating BO development and accelerating the discovery of novel algorithmic combinations. The source code is provided at https://github.com/Ewendawi/LLaMEA-BO.
中文摘要:本研究提出一种利用大型语言模型通过进化策略自动生成并优化贝叶斯优化算法的框架,在多数基准测试中表现优于现有方法。
English Summary: This study introduces a framework using Large Language Models to automatically generate and refine Bayesian optimization algorithms through evolutionary strategies, achieving superior performance over existing methods in most benchmark tests.

Authors:Nils Neukirch, Johanna Vielhaben, Nils Strodthoff
Title: FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models
Abstract:
Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
中文摘要:本文提出了一种条件扩散模型,以概率方式将特征空间映射到输入空间,有效提升了多种分类器中深度神经网络的可解释性,并展示了卓越的重建能力和应用前景。
English Summary: This paper introduces a conditional diffusion model to probabilistically map feature spaces to input spaces, enhancing the interpretability of deep neural networks across various classifiers with strong reconstruction and application potential.

Authors:Yuan Gao, Ruiqi Shu, Hao Wu, Fan Xu, Yanfei Xiang, Ruijian Gou, Qingsong Wen, Xian Wu, Kun Wang, Xiaomeng Huang
Title: NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation
Abstract:
Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM's core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing. Code link: https://github.com/YuanGao-YG/NeuralOM.
中文: NeuralOM提出了一种神经算子框架,通过渐进残差修正和物理引导图网络有效抑制慢变系统(如海洋)长期模拟中的误差累积,提升物理一致性,在预测精度和稳定性上均超越现有最佳模型。
English: NeuralOM introduces a neural operator framework with progressive residual correction and physics-guided graph networks to effectively suppress error accumulation and enhance physical consistency in long-term simulations of slow-changing systems like oceans, achieving superior accuracy and stability over existing models.

Authors:Devran Ugurlu, Shuang Qian, Elliot Fairweather, Charlene Mauger, Bram Ruijsink, Laura Dal Toso, Yu Deng, Marina Strocchi, Reza Razavi, Alistair Young, Pablo Lamata, Steven Niederer, Martin Bishop
Title: Cardiac Digital Twins at Scale from MRI: Open Tools and Representative Models from ~55000 UK Biobank Participants
Abstract:
A cardiac digital twin is a virtual replica of a patient's heart for screening, diagnosis, prognosis, risk assessment, and treatment planning of cardiovascular diseases. This requires an anatomically accurate patient-specific 3D structural representation of the heart, suitable for electro-mechanical simulations or study of disease mechanisms. However, generation of cardiac digital twins at scale is demanding and there are no public repositories of models across demographic groups. We describe an automatic open-source pipeline for creating patient-specific left and right ventricular meshes from cardiovascular magnetic resonance images, its application to a large cohort of ~55000 participants from UK Biobank, and the construction of the most comprehensive cohort of adult heart models to date, comprising 1423 representative meshes across sex (male, female), body mass index (range: 16 - 42 kg/m$^2$) and age (range: 49 - 80 years). Our code is available at https://github.com/cdttk/biv-volumetric-meshing/tree/plos2025 , and pre-trained networks, representative volumetric meshes with fibers and UVCs will be made available soon.
心脏数字孪生是一种用于疾病管理的虚拟心脏模型,本研究开发了一种自动化流程,能从心脏磁共振图像生成患者特异的心室网格,并应用于英国生物银行的大规模队列,构建了包含1423个涵盖不同人口统计学特征的成人心脏模型的最全面数据集。
A cardiac digital twin is a virtual heart model used for disease management, and this study presents an automated pipeline to generate patient-specific ventricular meshes from MRI data, applied to a large UK Biobank cohort to create a comprehensive set of 1423 representative adult heart models across demographics.

Authors:Cainan Davidson, Deva Ramanan, Neehar Peri
Title: RefAV: Towards Planning-Centric Scenario Mining
Abstract:
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recent CVPR 2025 competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
中文: 本研究提出RefAV方法,利用视觉语言模型从自动驾驶车辆数据中高效挖掘安全关键场景,通过新数据集和实证分析解决了传统方法的局限性。
English: This study introduces RefAV, a novel approach using vision-language models to efficiently mine safety-critical driving scenarios from autonomous vehicle data, addressing the limitations of traditional methods with a new dataset and empirical analysis.

Authors:Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov
Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization
Abstract:
Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.
中文: 本文提出一种基于强化学习的方法,通过自动生成训练数据对来提升个性化文生图的概念保真度和提示对齐效果,无需人工标注即可灵活调节图像质量与文本匹配的平衡。
English: This paper introduces a reinforcement learning-based method that enhances personalized text-to-image generation by automatically creating training pairs to improve concept fidelity and prompt alignment without human annotation.

Authors:Zhibo Wang, Xiaoze Jiang, Zhiheng Qin, Enyun Yu, Han Li
Title: Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation
Abstract:
Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users' search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at https://github.com/JXZe/LaD.
中文: LaD模型通过分层捕捉用户长短期兴趣并采用基于拒绝偏好优化的自适应去毒方法,解决了查询自动补全中的个性化表示和内容去毒两大挑战,在实际应用中取得了显著成效。
English: The LaD model addresses hierarchical personalization and detoxification in query auto-completion by capturing users' long-term and short-term interests and incorporating adaptive detoxification through Reject Preference Optimization, achieving significant improvements in real-world deployment.

Authors:Yaohua Zha, Yanzi Wang, Hang Guo, Jinpeng Wang, Tao Dai, Bin Chen, Zhihao Ouyang, Xue Yuerong, Ke Chen, Shu-Tao Xia
Title: PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
Abstract:
Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information. Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. Code is available at https://github.com/zyh16143998882/PMA.
中文摘要:提出的点云曼巴适配器(PMA)通过利用曼巴模型融合预训练模型各层特征,并结合几何约束门控提示生成器,有效整合中间层互补信息,将点云理解能力提升至新水平。
English Summary: The Point Mamba Adapter (PMA) is introduced to enhance point cloud understanding by integrating multi-layer features from pre-trained models using a Mamba-based fusion approach and a geometry-constrained gate prompt generator, overcoming limitations of existing methods that only utilize final outputs.

Authors:Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
中文: 提出的自适应文本想象器(ATD)采用双分支大语言模型架构,通过语言形式想象关键环境语义,以更低计算成本和参数量实现了领先的导航性能。
English: The proposed Adaptive Text Dreamer (ATD) uses a dual-branch LLM architecture to imagine future environmental semantics through language, reducing computational costs while achieving state-of-the-art navigation performance with fewer parameters.

Authors:Weichao Pan, Bohan Xu, Xu Wang, Chengze Lv, Shuoyang Wang, Zhenke Duan, Zhen Tian
Title: YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation
Abstract:
Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%. For more details, please visit our repository: https://github.com/JEFfersusu/YOLO-FireAD
Chinese: 本研究提出YOLO-FireAD模型,通过注意力引导的倒残差结构和双池化下采样融合技术,在降低参数量的同时显著提升了动态环境中火灾检测的准确率和效率。
English: The study introduces YOLO-FireAD, which enhances fire detection by integrating attention mechanisms and dual-pooling fusion to improve accuracy and efficiency while reducing parameters and computational costs.

Authors:Jiyoung Lee, Seungho Kim, Jieun Han, Jun-Min Lee, Kitaek Kim, Alice Oh, Edward Choi
Title: Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Abstract:
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.
中文摘要:当前大语言模型评估主要针对标准美式英语,可能引发公平性问题,因此开发了Trans-EnV框架来自动将数据集转换为38种英语变体,发现非标准变体性能下降高达46.3%。
English Summary: Current LLM evaluations primarily focus on Standard American English, potentially causing fairness issues, so the Trans-EnV framework was developed to automatically transform datasets into 38 English varieties, revealing performance drops of up to 46.3% in non-standard varieties.

Authors:Jiyoung Lee, Seungho Kim, Jieun Han, Jun-Min Lee, Kitaek Kim, Alice Oh, Edward Choi
Title: Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Abstract:
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.
中文摘要:当前大语言模型评估主要针对标准美式英语,可能引发公平性问题,因此开发了Trans-EnV框架来自动将数据集转换为38种英语变体,发现非标准变体性能下降高达46.3%。
English Summary: Current LLM evaluations primarily focus on Standard American English, potentially causing fairness issues, so the Trans-EnV framework was developed to automatically transform datasets into 38 English varieties, revealing performance drops of up to 46.3% in non-standard varieties.

Authors:Chaeyoung Jung, Youngjoon Jang, Joon Son Chung
Title: AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
Abstract:
Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.
中文: 本文提出视听对比解码(AVCD),这一无需训练的框架通过动态利用注意力分布和熵引导自适应解码,有效抑制视听大语言模型中的模态引发幻觉,在基准测试中显著提升了准确率。
English: The paper introduces Audio-Visual Contrastive Decoding (AVCD), a training-free framework that dynamically suppresses modality-induced hallucinations in AV-LLMs by leveraging attention distributions and entropy-guided adaptive decoding, achieving significant accuracy improvements on benchmarks.

Authors:Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang
Title: Cooperation of Experts: Fusing Heterogeneous Information with Large Margin
Abstract:
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.
Chinese: 专家协作(CoE)框架通过将多类型信息编码为统一的异构多重网络,解决了异构数据融合的挑战,并利用新颖的大间隔机制和定制优化策略,实现了领域特定编码器之间的稳健协作。
English: The Cooperation of Experts (CoE) framework addresses the challenge of fusing heterogeneous data by encoding multi-typed information into unified heterogeneous multiplex networks, enabling robust collaboration among domain-specific encoders through a novel large margin mechanism and tailored optimization strategy.

Authors:Dooho Lee, Myeong Kong, Sagad Hamid, Cheonwoo Lee, Jaemin Yoo
Title: Aggregation Buffer: Revisiting DropEdge with a New Parameter Block
Abstract:
We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures. Based on this analysis, we propose Aggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses well-known problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available at https://github.com/dooho00/agg-buffer.
Chinese: 本研究分析了DropEdge在图神经网络中因架构限制导致的性能局限,并提出通用参数模块Aggregation Buffer,该模块通过增强模型鲁棒性在多个数据集上实现持续性能提升,同时有效解决度偏差等已知问题。
English: This study analyzes DropEdge's limited performance in GNNs due to architectural constraints and introduces Aggregation Buffer, a universally compatible parameter block that enhances robustness and consistently improves performance across datasets while addressing issues like degree bias.

Authors:Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Title: AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset
Abstract:
Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.
中文摘要:本研究推出的AdParaphrase v2.0数据集通过大规模人工偏好标注,能够识别吸引人的广告文本语言特征,并探索了广告文本生成方法与评估指标的有效性。
English Summary: This study introduces AdParaphrase v2.0, a significantly expanded dataset with human preference annotations that enables identification of linguistic features for creating engaging ad texts and explores methods for their generation and evaluation.

Authors:Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Title: Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation
Abstract:
Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.
中文: RioRAG框架通过强化学习优化信息丰富性和采用以要点为中心的分层奖励模型,提升了长形式问答的事实准确性,有效解决了检索增强生成系统中数据稀缺和评估困难等核心问题。
English: The RioRAG framework introduces a reinforcement learning approach with reinforced informativeness optimization and a nugget-centric hierarchical reward model to enhance long-form question answering by improving factual accuracy and addressing data scarcity and evaluation challenges in retrieval-augmented generation systems.

Authors:Jiaping Xiao, Cheng Wen Tsao, Yuhang Zhang, Mir Feroskhan
Title: FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation
Abstract:
Path planning is a critical component in autonomous drone operations, enabling safe and efficient navigation through complex environments. Recent advances in foundation models, particularly large language models (LLMs) and vision-language models (VLMs), have opened new opportunities for enhanced perception and intelligent decision-making in robotics. However, their practical applicability and effectiveness in global path planning remain relatively unexplored. This paper proposes foundation model-guided path planners (FM-Planner) and presents a comprehensive benchmarking study and practical validation for drone path planning. Specifically, we first systematically evaluate eight representative LLM and VLM approaches using standardized simulation scenarios. To enable effective real-time navigation, we then design an integrated LLM-Vision planner that combines semantic reasoning with visual perception. Furthermore, we deploy and validate the proposed path planner through real-world experiments under multiple configurations. Our findings provide valuable insights into the strengths, limitations, and feasibility of deploying foundation models in real-world drone applications and providing practical implementations in autonomous flight. Project site: https://github.com/NTU-ICG/FM-Planner.
Chinese: 本文提出基于基础模型的无人机路径规划系统FM-Planner,通过仿真和实际部署验证了该系统在自主飞行应用中的可行性、优势与局限性。
English: This paper introduces FM-Planner, a foundation model-based path planning system for drones, which is evaluated through simulations and real-world experiments to assess its practical viability and limitations in autonomous navigation.

Authors:Noy Sternlicht, Tom Hope
Title: CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation
Abstract:
A hallmark of human innovation is recombination -- the creation of novel ideas by integrating elements from existing concepts and mechanisms. In this work, we introduce CHIMERA, a large-scale Knowledge Base (KB) of over 28K recombination examples automatically mined from the scientific literature. CHIMERA enables large-scale empirical analysis of how scientists recombine concepts and draw inspiration from different areas, and enables training models that propose novel, cross-disciplinary research directions. To construct this KB, we define a new information extraction task: identifying recombination instances in scientific abstracts. We curate a high-quality, expert-annotated dataset and use it to fine-tune a large language model, which we apply to a broad corpus of AI papers. We showcase the utility of CHIMERA through two applications. First, we analyze patterns of recombination across AI subfields. Second, we train a scientific hypothesis generation model using the KB, showing that it can propose novel research directions that researchers rate as inspiring. We release our data and code at https://github.com/noy-sternlicht/CHIMERA-KB.
Chinese: 本文介绍了CHIMERA,一个从科学文献中自动挖掘超过2.8万个重组案例的大规模知识库,可用于分析跨学科创新模式并训练生成新颖研究方向的人工智能模型。
English: This paper introduces CHIMERA, a large-scale Knowledge Base of over 28K recombination examples mined from scientific literature, enabling analysis of cross-disciplinary innovation and training models for novel research direction proposals.

Authors:Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .
推测解码在长输入下性能下降,而SpecExtend通过高效注意力机制和新型KV缓存策略,无需额外训练即可将长序列处理速度提升最高3.86倍。
Speculative decoding performance declines with longer inputs, but SpecExtend enhances it for long sequences using efficient attention and a novel KV cache strategy, achieving up to 3.86x speedup without extra training.

Authors:Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Abstract:
Speculative decoding is a widely used technique for accelerating inference in large language models (LLMs), but its performance degrades as input length grows, with significant drops even at moderate lengths. Yet, this early degradation has remained largely underexplored. We introduce SpecExtend, a drop-in enhancement that improves speculative decoding on long sequences without additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that leverages the target model's attention scores to dynamically select relevant context for the smaller draft model. Extensive evaluations show that SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning, while preserving the short-input performance of state-of-the-art frameworks. Our code is available at https://github.com/jycha98/SpecExtend .
推测解码在长输入下性能下降,而SpecExtend通过高效注意力机制和新型KV缓存策略,无需额外训练即可将长序列处理速度提升最高3.86倍。
Speculative decoding performance declines with longer inputs, but SpecExtend enhances it for long sequences using efficient attention and a novel KV cache strategy, achieving up to 3.86x speedup without extra training.

Authors:Xiaowen Ma, Zhenliang Ni, Shuai Xiao, Xinghao Chen
Title: TimePro: Efficient Multivariate Long-term Time Series Forecasting with Variable- and Time-Aware Hyper-state
Abstract:
In long-term time series forecasting, different variables often influence the target variable over distinct time intervals, a challenge known as the multi-delay issue. Traditional models typically process all variables or time points uniformly, which limits their ability to capture complex variable relationships and obtain non-trivial time representations. To address this issue, we propose TimePro, an innovative Mamba-based model that constructs variate- and time-aware hyper-states. Unlike conventional approaches that merely transfer plain states across variable or time dimensions, TimePro preserves the fine-grained temporal features of each variate token and adaptively selects the focused time points to tune the plain state. The reconstructed hyper-state can perceive both variable relationships and salient temporal information, which helps the model make accurate forecasting. In experiments, TimePro performs competitively on eight real-world long-term forecasting benchmarks with satisfactory linear complexity. Code is available at https://github.com/xwmaxwma/TimePro.
中文: TimePro是一种基于Mamba的创新模型,通过构建变量和时间感知的超状态来解决长期时间序列预测中的多延迟问题,在真实世界基准测试中以线性复杂度实现了优异性能。
English: TimePro is a novel Mamba-based model that addresses the multi-delay issue in long-term time series forecasting by constructing variate- and time-aware hyper-states, achieving competitive performance with linear complexity on real-world benchmarks.

Authors:Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
Title: CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Abstract:
Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on "factual statements" that rephrase source materials while overlooking "cognitive statements" that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
中文摘要:本研究提出了CogniBench框架和数据集,专门评估大型语言模型中对认知陈述的忠实度幻觉,并通过自动化流程生成大规模训练数据,以提升事实性和认知性幻觉的检测能力。
English Summary: The study introduces CogniBench, a framework and dataset for evaluating faithfulness hallucinations in LLMs, particularly focusing on cognitive statements, and develops an automated pipeline to create large-scale training data for detecting both factual and cognitive inaccuracies.

Authors:Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs
Title: ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Abstract:
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.
中文:提出的ConText-CIR框架通过文本概念一致性损失和合成数据生成技术,在多个基准测试中实现了组合图像检索的最新性能突破。
English: The proposed ConText-CIR framework improves composed image retrieval by using a Text Concept-Consistency loss and synthetic data generation, achieving state-of-the-art performance on multiple benchmarks.

Authors:Ryota Ushio, Takashi Ishida, Masashi Sugiyama
Title: Practical estimation of the optimal classification error with soft labels and calibration
Abstract:
While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
中文: 本文通过理论分析硬标签偏差衰减特性,并针对含噪软标签提出基于保序校准的一致性估计方法,在多种数据集上验证了二元分类中贝叶斯误差估计的有效性。
English: This paper enhances binary classification by refining Bayes error estimation through theoretical analysis of bias decay with hard labels and introducing a consistent estimator using isotonic calibration for corrupted soft labels, validated across various datasets.

Authors:Ryota Ushio, Takashi Ishida, Masashi Sugiyama
Title: Practical estimation of the optimal classification error with soft labels and calibration
Abstract:
While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
中文: 本文通过理论分析硬标签偏差衰减特性,并针对含噪软标签提出基于保序校准的一致性估计方法,在多种数据集上验证了二元分类中贝叶斯误差估计的有效性。
English: This paper enhances binary classification by refining Bayes error estimation through theoretical analysis of bias decay with hard labels and introducing a consistent estimator using isotonic calibration for corrupted soft labels, validated across various datasets.

Authors:Yufei Zhan, Hongyin Zhao, Yousong Zhu, Shurong Zheng, Fan Yang, Ming Tang, Jinqiao Wang
Title: Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.
中文摘要:该研究提出了一种统一视觉推理机制,通过类人认知流程实现大型多模态模型的单次推理组合问题求解,在复杂推理基准测试中展现出卓越性能。
English Summary: The study introduces a unified visual reasoning mechanism for Large Multimodal Models that enables single-pass compositional problem-solving through human-like cognitive processes, achieving superior performance on complex reasoning benchmarks.

Authors:Yanran Tang, Ruihong Qiu, Zi Huang
Title: UQLegalAI@COLIEE2025: Advancing Legal Case Retrieval with Large Language Models and Graph Neural Networks
Abstract:
Legal case retrieval plays a pivotal role in the legal domain by facilitating the efficient identification of relevant cases, supporting legal professionals and researchers to propose legal arguments and make informed decision-making. To improve retrieval accuracy, the Competition on Legal Information Extraction and Entailment (COLIEE) is held annually, offering updated benchmark datasets for evaluation. This paper presents a detailed description of CaseLink, the method employed by UQLegalAI, the second highest team in Task 1 of COLIEE 2025. The CaseLink model utilises inductive graph learning and Global Case Graphs to capture the intrinsic case connectivity to improve the accuracy of legal case retrieval. Specifically, a large language model specialized in text embedding is employed to transform legal texts into embeddings, which serve as the feature representations of the nodes in the constructed case graph. A new contrastive objective, incorporating a regularization on the degree of case nodes, is proposed to leverage the information within the case reference relationship for model optimization. The main codebase used in our method is based on an open-sourced repo of CaseLink: https://github.com/yanran-tang/CaseLink.
中文: CaseLink模型通过归纳图学习和全局案例图捕捉案例内在关联性,利用专业大语言模型生成文本嵌入并采用带节点度正则化的新型对比目标,有效提升了法律案例检索的准确性。
English: The CaseLink model enhances legal case retrieval accuracy by using inductive graph learning and Global Case Graphs to capture case connectivity, employing a specialized large language model for text embeddings and a novel contrastive objective with node degree regularization.

Authors:Sunwoo Kim, Soo Yong Lee, Jaemin Yoo, Kijung Shin
Title: 'Hello, World!': Making GNNs Talk with LLMs
Abstract:
While graph neural networks (GNNs) have shown remarkable performance across diverse graph-related tasks, their high-dimensional hidden representations render them black boxes. In this work, we propose Graph Lingual Network (GLN), a GNN built on large language models (LLMs), with hidden representations in the form of human-readable text. Through careful prompt design, GLN incorporates not only the message passing module of GNNs but also advanced GNN techniques, including graph attention and initial residual connection. The comprehensibility of GLN's hidden representations enables an intuitive analysis of how node representations change (1) across layers and (2) under advanced GNN techniques, shedding light on the inner workings of GNNs. Furthermore, we demonstrate that GLN achieves strong zero-shot performance on node classification and link prediction, outperforming existing LLM-based baseline methods.
中文: 图语言网络(GLN)将大语言模型与图神经网络相结合,生成可解释的文本形式隐藏表示,不仅直观地揭示了GNN的内部工作机制,还在节点分类和链接预测任务中实现了卓越的零样本性能。
English: The Graph Lingual Network (GLN) integrates large language models with graph neural networks to create interpretable, text-based hidden representations, enabling intuitive analysis of GNN operations and achieving superior zero-shot performance in node classification and link prediction tasks.

Authors:Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu
Title: MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science
Abstract:
The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 7K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field. Resources related to this benchmark can be found at https://huggingface.co/MSEarth and https://github.com/xiangyu-mm/MSEarth.
中文: MSEarth的推出填补了地球科学领域多模态大语言模型在研究生级别基准测试上的空白,通过基于科学文献构建的全面数据集,提升了科学推理与评估能力。
English: The introduction of MSEarth addresses the lack of graduate-level benchmarks for multimodal large language models in earth science by providing a comprehensive dataset derived from scientific publications to enhance scientific reasoning and evaluation.

Authors:Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li
Title: SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
Abstract:
Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA-RL-Agent.
中文摘要:强化学习在训练LLM智能体时面临延迟奖励的挑战,而提出的逐步进展归因(SPA)框架通过将最终奖励分解为逐步贡献来解决这一问题,有效提升训练效果和性能表现。
English Summary: Reinforcement learning faces delayed reward challenges in training LLM agents, which the proposed Stepwise Progress Attribution (SPA) framework addresses by decomposing final rewards into stepwise contributions to improve training effectiveness and performance.

Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Title: Are Data Embeddings effective in time series forecasting?
Abstract:
Time series forecasting plays a crucial role in many real-world applications, and numerous complex forecasting models have been proposed in recent years. Despite their architectural innovations, most state-of-the-art models report only marginal improvements -- typically just a few thousandths in standard error metrics. These models often incorporate complex data embedding layers to transform raw inputs into higher-dimensional representations to enhance accuracy. But are data embedding techniques actually effective in time series forecasting? Through extensive ablation studies across fifteen state-of-the-art models and four benchmark datasets, we find that removing data embedding layers from many state-of-the-art models does not degrade forecasting performance. In many cases, it improves both accuracy and computational efficiency. The gains from removing embedding layers often exceed the performance differences typically reported between competing models. Code available at: https://github.com/neuripsdataembedidng/DataEmbedding
Chinese: 最新研究表明,去除先进时间序列预测模型中的复杂数据嵌入层通常能同时提升预测精度和计算效率,其性能改进甚至超过模型间常规比较的差异。
English: Recent research reveals that removing complex data embedding layers from advanced time series forecasting models often improves both accuracy and computational efficiency, with performance gains surpassing typical model comparison metrics.

Authors:Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chenliang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Abstract:
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.
中文:MUSEG是一种基于强化学习的新方法,通过引入时间戳感知的多片段定位和分阶段奖励训练,显著提升了多模态大语言模型的时序理解能力,在时序推理任务中明显优于现有方法。
English: MUSEG is a novel reinforcement learning method that enhances multimodal large language models' temporal understanding by enabling timestamp-aware multi-segment grounding and phased reward training, significantly outperforming existing approaches in temporal reasoning tasks.

Authors:Zechen Li, Lanqing Yang, Yiheng Bian, Hao Pan, Yongjian Fu, Yezhou Wang, Yi-Chao Chen, Guangtao Xue, Ju Ren
Title: Wideband RF Radiance Field Modeling Using Frequency-embedded 3D Gaussian Splatting
Abstract:
This paper presents an innovative frequency-embedded 3D Gaussian splatting (3DGS) algorithm for wideband radio-frequency (RF) radiance field modeling, offering an advancement over the existing works limited to single-frequency modeling. Grounded in fundamental physics, we uncover the complex relationship between EM wave propagation behaviors and RF frequencies. Inspired by this, we design an EM feature network with attenuation and radiance modules to learn the complex relationships between RF frequencies and the key properties of each 3D Gaussian, specifically the attenuation factor and RF signal intensity. By training the frequency-embedded 3DGS model, we can efficiently reconstruct RF radiance fields at arbitrary unknown frequencies within a given 3D environment. Finally, we propose a large-scale power angular spectrum (PAS) dataset containing 50000 samples ranging from 1 to 100 GHz in 6 indoor environments, and conduct extensive experiments to verify the effectiveness of our method. Our approach achieves an average Structural Similarity Index Measure (SSIM) up to 0.72, and a significant improvement up to 17.8% compared to the current state-of-the-art (SOTA) methods trained on individual test frequencies. Additionally, our method achieves an SSIM of 0.70 without prior training on these frequencies, which represents only a 2.8% performance drop compared to models trained with full PAS data. This demonstrates our model's capability to estimate PAS at unknown frequencies. For related code and datasets, please refer to https://github.com/sim-2-real/Wideband3DGS.
本文提出了一种频率嵌入的3D高斯泼溅算法,通过建立电磁特征网络实现了宽带射频辐射场的任意频率重建,相比现有方法性能提升最高达17.8%,并在未训练频率上仍保持优异性能。
This paper introduces a frequency-embedded 3D Gaussian splatting algorithm that advances wideband RF radiance field modeling by enabling reconstruction at arbitrary frequencies, achieving up to 17.8% improvement over existing methods and demonstrating robust performance even on untrained frequencies.

Authors:Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin
Title: Sparsified State-Space Models are Efficient Highway Networks
Abstract:
State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/Simba.
中文: 本文提出Simba方法,通过对状态空间模型进行分层稀疏化处理,在高层实施更激进的令牌剪枝以构建信息高速通道,在相同计算量下显著提升了Mamba基线模型在多项自然语言任务中的性能表现。
English: This paper introduces Simba, a hierarchical sparsification method for state-space models that prunes tokens more aggressively in upper layers to create information highways, improving both efficiency and performance over the baseline Mamba model across various language tasks.

Authors:Mingxuan Sun, Juntao Jiang, Zhiqiang Yang, Shenao Kong, Jiamin Qi, Jianru Shang, Shuangling Luo, Wanfa Sun, Tianyi Wang, Yanqi Wang, Qixuan Wang, Tingjian Dai, Tianxiang Chen, Jinming Zhang, Xuerui Zhang, Yuepeng He, Pengcheng Fu, Qiu Guan, Shizheng Zhou, Yanbo Yu, Qigui Jiang, Teng Zhou, Liuyong Shi, Hong Yan
Title: VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images
Abstract:
Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: https://github.com/juntaoJianggavin/Visalgae2023/.
中文:VisAlgae 2023挑战赛通过369支参赛团队利用多样化数据集应对小目标识别和复杂背景等难题,推动了高通量微藻检测技术的发展,其优胜方案为生态研究和技术创新提供了重要参考。
English: The VisAlgae 2023 Challenge advanced high-throughput microalgae detection by engaging 369 teams to address challenges like small target identification and complex backgrounds using a diverse dataset, with top methods providing valuable insights for ecological and technological progress.

Authors:Kianté Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, Xuezhou Zhang
Title: Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Abstract:
Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose $A$*-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function $V$*, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, $A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2$\times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL. Implementation of $A$*-PO can be found at https://github.com/ZhaolinGao/A-PO.
中文: 提出的A*-PO框架通过离线采样估计最优优势函数并进行在线策略更新,有效优化大语言模型的推理能力,在保持竞争力的同时显著降低了计算和内存开销。
English: The proposed A*-PO framework efficiently optimizes large language models for reasoning tasks by estimating the optimal advantage function through offline sampling and on-policy updates, achieving competitive performance while significantly reducing computational and memory costs compared to existing methods.

Authors:Danush Khanna, Pratinav Seth, Sidhaarth Sredharan Murali, Aditya Kumar Guru, Siddharth Shukla, Tanuj Tyagi, Sandeep Chaurasia, Kripabandhu Ghosh
Title: SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations
Abstract:
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .
中文: 本研究提出了MultiManip数据集和SELF-PERCEPT框架,以解决大型语言模型在多轮对话中检测微妙心理操纵的难题,相比现有模型展现出显著性能提升。
English: The study introduces the MultiManip dataset and SELF-PERCEPT framework to address LLMs' challenges in detecting nuanced mental manipulation in multi-turn dialogues, showing significant improvement over existing models.

Authors:Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Title: Pretraining Language Models to Ponder in Continuous Space
Abstract:
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
中文摘要:本研究通过单次生成步骤中迭代处理词嵌入的方式,为语言模型引入“深思”机制,使模型能通过自监督学习以更少参数实现与大型模型相当的性能。
English Summary: This study introduces a "pondering" mechanism into language models by iteratively processing token embeddings within a single generation step, enabling models to achieve performance comparable to larger counterparts with fewer parameters through self-supervised learning.

Authors:Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, Maosong Sun
Title: AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
Abstract:
Efficient experiment reproduction is critical to accelerating progress in artificial intelligence. However, the inherent complexity of method design and training procedures presents substantial challenges for automation. Notably, reproducing experiments often requires implicit domain-specific knowledge not explicitly documented in the original papers. To address this, we introduce the paper lineage algorithm, which identifies and extracts implicit knowledge from the relevant references cited by the target paper. Building on this idea, we propose AutoReproduce, a multi-agent framework capable of automatically reproducing experiments described in research papers in an end-to-end manner. AutoReproduce enhances code executability by generating unit tests alongside the reproduction process. To evaluate the reproduction capability, we construct ReproduceBench, a benchmark annotated with verified implementations, and introduce novel evaluation metrics to assess both the reproduction and execution fidelity. Experimental results demonstrate that AutoReproduce outperforms the existing strong agent baselines on all five evaluation metrics by a peak margin of over $70\%$. In particular, compared to the official implementations, AutoReproduce achieves an average performance gap of $22.1\%$ on $89.74\%$ of the executable experiment runs. The code will be available at https://github.com/AI9Stars/AutoReproduce.
Chinese: 针对人工智能实验复现中因隐含领域知识带来的挑战,本文提出了AutoReproduce多智能体框架,能够端到端自动复现实验,并在关键指标上以最高70%的优势超越现有基线方法。
English: To address the challenges of reproducing AI experiments due to implicit domain knowledge, this paper introduces AutoReproduce, a multi-agent framework that automatically reproduces experiments end-to-end and outperforms existing baselines by up to 70% on key metrics.

Authors:Lingyu Qiu, Ke Jiang, Xiaoyang Tan
Title: RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment
Abstract:
Recent advancements in domain generalization for deepfake detection have attracted significant attention, with previous methods often incorporating additional modules to prevent overfitting to domain-specific patterns. However, such regularization can hinder the optimization of the empirical risk minimization (ERM) objective, ultimately degrading model performance. In this paper, we propose a novel learning objective that aligns generalization gradient updates with ERM gradient updates. The key innovation is the application of perturbations to model parameters, aligning the ascending points across domains, which specifically enhances the robustness of deepfake detection models to domain shifts. This approach effectively preserves domain-invariant features while managing domain-specific characteristics, without introducing additional regularization. Experimental results on multiple challenging deepfake detection datasets demonstrate that our gradient alignment strategy outperforms state-of-the-art domain generalization techniques, confirming the efficacy of our method. The code is available at https://github.com/Lynn0925/RoGA.
中文: 本文提出一种新颖的梯度对齐策略,通过协调泛化与经验风险最小化的更新来增强深度伪造检测能力,无需额外正则化即可超越现有领域泛化方法。
English: This paper introduces a novel gradient alignment strategy that enhances deepfake detection by aligning generalization and empirical risk minimization updates, outperforming existing domain generalization methods without extra regularization.

Authors:Mengmeng Chen, Xiaohu Wu, Qiqi Liu, Tiantian He, Yew-Soon Ong, Yaochu Jin, Qicheng Lao, Han Yu
Title: Voronoi-grid-based Pareto Front Learning and Its Application to Collaborative Federated Learning
Abstract:
Multi-objective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field. The code is available at https://github.com/buptcmm/phnhvvs}{https://github.com/buptcmm/phnhvvs.
中文:PHN-HVVS框架通过引入Voronoi网格分解和遗传算法,有效解决了帕累托前沿学习中的覆盖难题,在多项多目标优化任务中展现出优于基准方法的性能。
English: The PHN-HVVS framework addresses challenges in Pareto-Front Learning by employing Voronoi grid decomposition and a genetic algorithm to enhance coverage of the Pareto front, demonstrating superior performance in multi-objective optimization tasks.

Authors:Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Title: HCQA-1.5 @ Ego4D EgoSchema Challenge 2025
Abstract:
In this report, we present the method that achieves third place for Ego4D EgoSchema Challenge in CVPR 2025. To improve the reliability of answer prediction in egocentric video question answering, we propose an effective extension to the previously proposed HCQA framework. Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism that selects high-confidence answers directly. For low-confidence cases, we incorporate a fine-grained reasoning module that performs additional visual and contextual analysis to refine the predictions. Evaluated on the EgoSchema blind test set, our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions, outperforming last year's winning solution and the majority of participating teams. Our code will be added at https://github.com/Hyu-Zhang/HCQA.
中文: 我们通过改进HCQA框架,采用多源预测聚合与置信度筛选机制,在CVPR 2025 Ego4D EgoSchema挑战赛中荣获第三名,在以自我为中心的视频问答任务上达到77%的准确率。
English: Our method, an enhanced HCQA framework with multi-source prediction aggregation and confidence-based filtering, secured third place in the CVPR 2025 Ego4D EgoSchema Challenge by achieving 77% accuracy on egocentric video QA.

Authors:Yuan Wu, Zhiqiang Yan, Yigong Zhang, Xiang Li, Jian Yang
Title: See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction
Abstract:
Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose \textbf{LIAR}, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available \href{https://github.com/yanzq95/LIAR}{here}.
中文: LIAR是一种创新框架,通过学习光照适应表征,利用选择性低光图像增强和光照感知组件解决夜间场景中的欠曝光和过曝光问题,在挑战性条件下表现出卓越性能。
English: LIAR is a novel framework that enhances nighttime occupancy prediction by learning illumination-affined representations, using selective low-light image enhancement and illumination-aware components to address underexposure and overexposure, achieving superior performance in challenging conditions.

Authors:Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang
Title: Open-Det: An Efficient Learning Framework for Open-Ended Detection
Abstract:
Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.
中文: Open-Det框架通过加速训练和弥合视觉与语言间的语义差距,有效解决了现有开放目标检测模型的不足,仅用少量数据和计算资源便实现了更优性能。
English: The Open-Det framework efficiently addresses the limitations of existing Open-Ended object Detection models by accelerating training and bridging the vision-language gap, achieving superior performance with significantly reduced data and computational resources.

Authors:Wenhao You, Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Zhongyu Ouyang, Chiyu Ma, Tingxuan Wu, Noah Wei, Zong Ke, Ming Cheng, Soroush Vosoughi, Jiang Gui
Title: Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs
Abstract:
While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this position paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. This work is intended to inspire broader attention and further research, supported by a continuously updated anonymous GitHub repository of relevant papers: https://github.com/xid32/Survey4MusicAVQA.
中文摘要:本文强调音乐视听问答领域需要专门的多模态方法,指出必须采用定制化输入处理、时空架构设计和音乐特定建模策略,以应对该领域连续密集的视听内容及复杂时序动态等独特挑战。
English Summary: This position paper emphasizes the need for specialized multimodal approaches in music audio-visual question answering, highlighting the importance of tailored input processing, spatiotemporal architectures, and music-specific modeling strategies to address the domain's unique challenges.

Authors:Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien
Title: Position: Adopt Constraints Over Penalties in Deep Learning
Abstract:
Recent efforts to develop trustworthy AI systems with accountability guarantees have led to widespread use of machine learning formulations incorporating external requirements, or constraints. These requirements are often enforced via penalization--adding fixed-weight terms to the task loss. We argue this approach is fundamentally ill-suited since there may be no penalty coefficient that simultaneously ensures constraint satisfaction and optimal constrained performance, i.e., that truly solves the constrained problem. Moreover, tuning these coefficients requires costly trial-and-error, incurring significant time and computational overhead. We, therefore, advocate for broader adoption of tailored constrained optimization methods--such as the Lagrangian approach, which jointly optimizes the penalization "coefficients" (the Lagrange multipliers) and the model parameters. Such methods (i) truly solve the constrained problem and do so accountably, by clearly defining feasibility and verifying when it is achieved, (ii) eliminate the need for extensive penalty tuning, and (iii) integrate seamlessly with modern deep learning pipelines.
中文摘要:当前AI系统常采用固定惩罚项来施加约束,但这种方法既无法确保约束满足与最优性能,又需耗费大量调参成本,因此采用拉格朗日法等定制化约束优化方法更能有效实现可信AI的可问责发展。
English summary: Current AI systems often use fixed penalty terms to enforce constraints, but this approach fails to guarantee both constraint satisfaction and optimal performance while requiring costly tuning, making tailored constrained optimization methods like the Lagrangian approach more effective for accountable AI development.

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Abstract:
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 16.8 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/
中文: 视觉语言模型在常见物体检测上表现出色,但难以泛化到分布外概念,为此我们提出Roboflow100-VL基准,通过多模态指令实现少样本概念对齐以解决这一局限。
English: Vision-language models excel at detecting common objects but struggle with out-of-distribution concepts, prompting the introduction of Roboflow100-VL for few-shot alignment using multimodal instructions to address this limitation.

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Abstract:
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
中文: 视觉语言模型在常见物体检测上表现出色,但难以泛化到分布外概念,为此我们提出Roboflow100-VL基准,通过多模态指令实现少样本概念对齐以解决这一局限。
English: Vision-language models excel at detecting common objects but struggle with out-of-distribution concepts, prompting the introduction of Roboflow100-VL for few-shot alignment using multimodal instructions to address this limitation.

Authors:Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
Title: Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction
Abstract:
The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .
中文: Prot2Token提出了一种统一框架,将多种蛋白质预测任务转化为标准化的下一标记预测格式,通过多任务学习实现了与专业模型相当或更优的性能,并显著提升了计算效率。
English: Prot2Token introduces a unified framework that transforms various protein prediction tasks into a standardized next-token prediction format, enabling efficient multi-task learning with performance matching or exceeding specialized models while achieving significant speedups.

Authors:Can Polat, Mehmet Tuncel, Mustafa Kurban, Erchin Serpedin, Hasan Kurban
Title: xChemAgents: Agentic AI for Explainable Quantum Chemistry
Abstract:
Recent progress in multimodal graph neural networks has demonstrated that augmenting atomic XYZ geometries with textual chemical descriptors can enhance predictive accuracy across a range of electronic and thermodynamic properties. However, naively appending large sets of heterogeneous descriptors often degrades performance on tasks sensitive to molecular shape or symmetry, and undermines interpretability. xChemAgents proposes a cooperative agent framework that injects physics-aware reasoning into multimodal property prediction. xChemAgents comprises two language-model-based agents: a Selector, which adaptively identifies a sparse, weighted subset of descriptors relevant to each target, and provides a natural language rationale; and a Validator, which enforces physical constraints such as unit consistency and scaling laws through iterative dialogue. On standard benchmark datasets, xChemAgents achieves up to a 22% reduction in mean absolute error over the state-of-the-art baselines, while producing faithful, human-interpretable explanations. Experiment results highlight the potential of cooperative, self-verifying agents to enhance both accuracy and transparency in foundation-model-driven materials science. The implementation and accompanying dataset are available at https://github.com/KurbanIntelligenceLab/xChemAgents.
中文摘要:xChemAgents提出了一种协作代理框架,通过自适应选择相关化学描述符并强制执行物理约束,在多模态性质预测中实现了22%的误差降低,同时提高了模型的可解释性。
English Summary: xChemAgents introduces a cooperative agent framework that enhances multimodal property prediction by adaptively selecting relevant chemical descriptors and enforcing physical constraints, achieving a 22% error reduction while improving interpretability.

Authors:Jihoon Lee, Min Song
Title: Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models
Abstract:
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
中文: RVCD是一种先进的对比解码方法,通过在逻辑层面同时利用负面和正面的AI生成图像,有效抑制视觉语言模型中的物体幻觉问题,相比现有方法展现出显著改进。
English: RVCD is an advanced contrastive decoding method that suppresses object hallucination in vision-language models by leveraging both negative and positive AI-generated images at the logit level, showing significant improvements over existing approaches.

Authors:Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li
Title: Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
Abstract:
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.
Chinese Summary: 本研究提出BARL这一贝叶斯强化学习框架,通过增强大语言模型的反思性探索与策略适应能力,在推理任务中实现了优于传统方法的性能与更高的效率。
English Summary: The study introduces BARL, a Bayesian reinforcement learning framework that enhances large language models' reflective exploration and strategy adaptation, outperforming standard methods in reasoning tasks with greater efficiency.

Authors:Elias Arbash, Ahmed Jamal Afifi, Ymane Belahsen, Margret Fuchs, Pedram Ghamisi, Paul Scheunders, Richard Gloaguen
Title: Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset
Abstract:
The global challenge of sustainable recycling demands automated, fast, and accurate, state-of-the-art (SOTA) material detection systems that act as a bedrock for a circular economy. Democratizing access to these cutting-edge solutions that enable real-time waste analysis is essential for scaling up recycling efforts and fostering the Green Deal. In response, we introduce \textbf{Electrolyzers-HSI}, a novel multimodal benchmark dataset designed to accelerate the recovery of critical raw materials through accurate electrolyzer materials classification. The dataset comprises 55 co-registered high-resolution RGB images and hyperspectral imaging (HSI) data cubes spanning the 400--2500 nm spectral range, yielding over 4.2 million pixel vectors and 424,169 labeled ones. This enables non-invasive spectral analysis of shredded electrolyzer samples, supporting quantitative and qualitative material classification and spectral properties investigation. We evaluate a suite of baseline machine learning (ML) methods alongside SOTA transformer-based deep learning (DL) architectures, including Vision Transformer, SpectralFormer, and the Multimodal Fusion Transformer, to investigate architectural bottlenecks for further efficiency optimisation when deploying transformers in material identification. We implement zero-shot detection techniques and majority voting across pixel-level predictions to establish object-level classification robustness. In adherence to the FAIR data principles, the electrolyzers-HSI dataset and accompanying codebase are openly available at https://github.com/hifexplo/Electrolyzers-HSI and https://rodare.hzdr.de/record/3668, supporting reproducible research and facilitating the broader adoption of smart and sustainable e-waste recycling solutions.
中文: 该研究推出了Electrolyzers-HSI数据集,这一多模态资源结合了RGB和高光谱数据,旨在提升关键原材料的回收效率,并通过评估机器学习模型优化材料分类技术,推动可持续电子废物回收的发展。
English: The study introduces the Electrolyzers-HSI dataset, a multimodal resource with RGB and hyperspectral data to enhance critical raw material recovery, and evaluates machine learning models for efficient material classification to advance sustainable e-waste recycling.

Authors:Tal Gonen, Itai Pemper, Ilan Naiman, Nimrod Berman, Omri Azencot
Title: Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach
Abstract:
Generative modeling of time series is a central challenge in time series analysis, particularly under data-scarce conditions. Despite recent advances in generative modeling, a comprehensive understanding of how state-of-the-art generative models perform under limited supervision remains lacking. In this work, we conduct the first large-scale study evaluating leading generative models in data-scarce settings, revealing a substantial performance gap between full-data and data-scarce regimes. To close this gap, we propose a unified diffusion-based generative framework that can synthesize high-fidelity time series across diverse domains using just a few examples. Our model is pre-trained on a large, heterogeneous collection of time series datasets, enabling it to learn generalizable temporal representations. It further incorporates architectural innovations such as dynamic convolutional layers for flexible channel adaptation and dataset token conditioning for domain-aware generation. Without requiring abundant supervision, our unified model achieves state-of-the-art performance in few-shot settings-outperforming domain-specific baselines across a wide range of subset sizes. Remarkably, it also surpasses all baselines even when tested on full datasets benchmarks, highlighting the strength of pre-training and cross-domain generalization. We hope this work encourages the community to revisit few-shot generative modeling as a key problem in time series research and pursue unified solutions that scale efficiently across domains. Code is available at https://github.com/azencot-group/ImagenFew.
中文摘要:本研究提出了一种基于扩散的统一框架,通过跨领域预训练和架构创新,在极少数据条件下实现了生成高质量时间序列的最先进性能。
English Summary: This study introduces a unified diffusion-based framework that achieves state-of-the-art performance in generating high-fidelity time series with minimal data by leveraging cross-domain pre-training and architectural innovations.

Authors:Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu
Title: HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.
中文: 提出的HoPE方法通过混合频率分配和动态时间缩放机制增强视觉语言模型的长上下文处理能力,在长视频理解和检索任务中相比现有方法展现出更优性能。
English: The proposed HoPE method enhances Vision-Language Models' long-context capabilities through hybrid frequency allocation and dynamic temporal scaling, demonstrating superior performance in long video understanding and retrieval tasks compared to existing approaches.

Authors:Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu
Title: HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.
中文: 提出的HoPE方法通过混合频率分配和动态时间缩放机制增强视觉语言模型的长上下文处理能力,在长视频理解和检索任务中相比现有方法展现出更优性能。
English: The proposed HoPE method enhances Vision-Language Models' long-context capabilities through hybrid frequency allocation and dynamic temporal scaling, demonstrating superior performance in long video understanding and retrieval tasks compared to existing approaches.

Authors:Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong
Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Abstract:
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
中文摘要:GraphGen是一个知识图谱引导的框架,通过识别大语言模型的知识缺口并采用多跳采样技术生成高质量问答数据,在解决监督微调数据稀缺问题上优于传统合成方法。
English Summary: GraphGen is a knowledge graph-guided framework that generates high-quality synthetic question-answering data by identifying knowledge gaps in LLMs and incorporating multi-hop sampling, outperforming conventional methods in addressing data scarcity for fine-tuning.

Authors:Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Carlos Fernandez, Christoph Stiller
Title: RetroMotion: Retrocausal Motion Forecasting Models are Instructable
Abstract:
Motion forecasts of road users (i.e., agents) vary in complexity as a function of scene constraints and interactive behavior. We address this with a multi-task learning method for motion forecasting that includes a retrocausal flow of information. The corresponding tasks are to forecast (1) marginal trajectory distributions for all modeled agents and (2) joint trajectory distributions for interacting agents. Using a transformer model, we generate the joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. Per trajectory point, we model positional uncertainty using compressed exponential power distributions. Notably, our method achieves state-of-the-art results in the Waymo Interaction Prediction dataset and generalizes well to the Argoverse 2 dataset. Additionally, our method provides an interface for issuing instructions through trajectory modifications. Our experiments show that regular training of motion forecasting leads to the ability to follow goal-based instructions and to adapt basic directional instructions to the scene context. Code: https://github.com/kit-mrt/future-motion
中文摘要:本文提出一种多任务学习的运动预测方法,通过逆向因果信息流和Transformer模型生成个体与交互轨迹分布,在基准数据集上实现最优性能,同时支持基于目标的轨迹指令修改。
English Summary: This paper introduces a multi-task learning method for motion forecasting that uses retrocausal information flow and transformer models to generate both marginal and joint trajectory distributions, achieving state-of-the-art results on benchmark datasets while enabling goal-based trajectory instructions.

Authors:Jaeyoung Choe, Jihoon Kim, Woohwan Jung
Title: Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
Abstract:
Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
中文摘要:针对金融领域标准化文档重复内容导致检索精度下降的问题,提出HiREC框架,通过分层检索与证据筛选机制有效消除近似重复文本,并构建LOFin基准验证其优越性。
English Summary: The HiREC framework is introduced to enhance retrieval-augmented generation in finance by using hierarchical retrieval and evidence curation to eliminate duplicate and irrelevant text, improving accuracy in processing standardized documents.

Authors:Jiahui Geng, Qing Li, Zongxiong Chen, Yuxia Wang, Derui Zhu, Zhuohan Xie, Chenyang Lyu, Xiuying Chen, Preslav Nakov, Fakhri Karray
Title: VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration
Abstract:
The rapid advancement of vision-language models (VLMs) has brought a lot of attention to their safety alignment. However, existing methods have primarily focused on model undersafety, where the model responds to hazardous queries, while neglecting oversafety, where the model refuses to answer safe queries. In this paper, we introduce the concept of $\textit{safety calibration}$, which systematically addresses both undersafety and oversafety. Specifically, we present $\textbf{VSCBench}$, a novel dataset of 3,600 image-text pairs that are visually or textually similar but differ in terms of safety, which is designed to evaluate safety calibration across image-centric and text-centric scenarios. Based on our benchmark, we evaluate safety calibration across eleven widely used VLMs. Our extensive experiments revealed major issues with both undersafety and oversafety. We further investigated four approaches to improve the model's safety calibration. We found that even though some methods effectively calibrated the models' safety problems, these methods also lead to the degradation of models' utility. This trade-off underscores the urgent need for advanced calibration methods, and our benchmark provides a valuable tool for evaluating future approaches. Our code and data are available at https://github.com/jiahuigeng/VSCBench.git.
中文摘要:本文提出安全校准概念以解决视觉语言模型中的欠安全与过安全问题,通过VSCBench基准测试揭示了现有模型在安全性与实用性之间存在显著权衡。
English Summary: This paper introduces safety calibration to address both undersafety and oversafety in vision-language models, using the VSCBench dataset to reveal critical safety-utility trade-offs in current models.

Authors:Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei
Title: Risk-aware Direct Preference Optimization under Nested Risk Measure
Abstract:
When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.
中文:Ra-DPO通过引入风险感知度量来优化大型语言模型的性能并控制模型偏移,在多个基准数据集上实现了更优的对齐效果。
English: Ra-DPO enhances LLM alignment by integrating risk-aware measures to optimize performance while controlling model drift, achieving superior results on benchmark datasets.

Authors:Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park
Title: GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Abstract:
Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git
Chinese: 低秩适应(LoRA)因梯度纠缠问题在较高秩时存在过拟合,而粒度低秩适应(GraLoRA)通过将权重矩阵划分为子块有效解决了这一局限,在代码生成和常识推理基准测试中表现更优。
English: Low-Rank Adaptation (LoRA) faces overfitting issues at higher ranks due to gradient entanglement, but Granular Low-Rank Adaptation (GraLoRA) overcomes this by partitioning weight matrices into sub-blocks, achieving superior performance in benchmarks like code generation and commonsense reasoning.

Authors:Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li
Title: Rethinking Text-based Protein Understanding: Retrieval or LLM?
Abstract:
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.
中文: 该研究揭示了当前蛋白质-文本模型存在数据泄露和评估指标不足的问题,并提出了一种检索增强方法,在无需训练的情况下显著优于微调大语言模型,实现了更高的准确性和效率。
English: The study identifies data leakage and inadequate evaluation metrics in current protein-text models, proposing a retrieval-enhanced method that surpasses fine-tuned LLMs in protein-to-text generation with improved accuracy and efficiency.

Authors:Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, Ying Nian Wu
Title: FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Abstract:
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at https://github.com/NoakLiu/FastCache-xDiT.
中文摘要:FastCache是一种创新的缓存压缩框架,通过自适应过滤冗余令牌和复用潜在激活来加速扩散Transformer推理,在保持生成质量的同时显著降低计算成本。
English Summary: FastCache is a novel caching and compression framework that accelerates Diffusion Transformer inference by adaptively filtering redundant tokens and reusing latent activations, significantly reducing computational costs while maintaining generation quality.

Authors:Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao
Title: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
Abstract:
Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.
中文: 本文提出自对弈强化学习(SeRL)方法,通过自我生成指令和奖励机制,使大语言模型能够在缺乏外部高质量数据的情况下实现有效训练,在多项推理基准测试中取得了优于同类方法的性能表现。
English: This paper introduces Self-play Reinforcement Learning (SeRL), a method that enables large language models to generate their own instructions and rewards for effective training without relying on external high-quality data, achieving superior reasoning performance across various benchmarks.

Authors:Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li
Title: Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
Abstract:
Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.
中文: EmoCorrector采用检索增强生成技术,通过提取文本情感特征并匹配情感语音样本,在保持说话人特征和音质的同时,有效解决了文本语音编辑中的情感不一致问题。
English: EmoCorrector introduces a post-correction scheme using Retrieval-Augmented Generation to address emotional inconsistencies in text-based speech editing by aligning synthesized speech with desired emotions while preserving speaker identity and quality.

Authors:Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Title: BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Abstract:
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
中文: BiomedSQL是首个专门评估生物医学知识库中文本转SQL系统科学推理能力的基准,结果显示现有模型性能与专家水平存在显著差距,为提升结构化数据推理支持科学发现奠定了基础。
English: BiomedSQL is a new benchmark designed to assess scientific reasoning in text-to-SQL systems for biomedical databases, revealing a significant performance gap where even the best models fall well below expert accuracy.

Authors:Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Title: BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Abstract:
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
中文: BiomedSQL是首个专门评估生物医学知识库中文本转SQL系统科学推理能力的基准,结果显示现有模型性能与专家水平存在显著差距,为提升结构化数据推理支持科学发现奠定了基础。
English: BiomedSQL is a new benchmark designed to assess scientific reasoning in text-to-SQL systems for biomedical databases, revealing a significant performance gap where even the best models fall well below expert accuracy.

Authors:Joseph Maffetone, Julia Gersey, Pei Zhang
Title: ZV-Sim: Probabilistic Simulation Framework for Pre-emergent Novel Zoonose Tracking
Abstract:
ZV-Sim is an open-source, modular Python framework for probabilistic simulation and analysis of pre-emergent novel zoonotic diseases using pervasive sensing data. It incorporates customizable Human and Animal Presence agents that leverage known and simulated location data, contact networks, and illness reports to assess and predict disease origins and spread. The framework supports Monte Carlo experiments to analyze outcomes with various user-defined movement and probability models. Although initial models are basic and illustrative, ZV-Sim's extensible design facilitates the integration of more sophisticated models as richer data become available, enhancing future capabilities in zoonotic disease tracking. The source code is publicly available \href{https://github.com/jmaff/zv-sim}{\underline{\textit{here}}}.
中文: ZV-Sim 是一个开源的 Python 框架,通过可定制的智能体和蒙特卡洛实验模拟分析人畜共患病的起源与传播,其可扩展设计支持未来集成更复杂的模型。
English: ZV-Sim is an open-source Python framework for simulating and analyzing the origins and spread of zoonotic diseases through customizable agents and Monte Carlo experiments, with an extensible design for future enhancements.

Authors:Patrick Yubeaton, Andre Nakkab, Weihua Xiao, Luca Collini, Ramesh Karri, Chinmay Hegde, Siddharth Garg
Title: VeriThoughts: Enabling Automated Verilog Code Generation using Reasoning and Formal Verification
Abstract:
This paper introduces VeriThoughts, a novel dataset designed for reasoning-based Verilog code generation. We establish a new benchmark framework grounded in formal verification methods to evaluate the quality and correctness of generated hardware descriptions. Additionally, we present a suite of specialized small-scale models optimized specifically for Verilog generation. Our work addresses the growing need for automated hardware design tools that can produce verifiably correct implementations from high-level specifications, potentially accelerating the hardware development process while maintaining rigorous correctness guarantees. Our code and data are available at \href{https://github.com/wilyub/VeriThoughts}{this URL}.
中文: 本文介绍了VeriThoughts,这是一个基于形式化验证的Verilog代码生成数据集和评估框架,并提供了专用模型来自动化硬件设计,确保正确性。
English: This paper presents VeriThoughts, a new dataset and benchmark for generating and verifying Verilog code using formal methods, along with specialized models to automate hardware design with guaranteed correctness.

Authors:Jianpeng Chen, Wangzhi Zhan, Haohui Wang, Zian Jia, Jingru Gan, Junkai Zhang, Jingyuan Qi, Tingwei Chen, Lifu Huang, Muhao Chen, Ling Li, Wei Wang, Dawei Zhou
Title: MetamatBench: Integrating Heterogeneous Data, Computational Tools, and Visual Interface for Metamaterial Discovery
Abstract:
Metamaterials, engineered materials with architected structures across multiple length scales, offer unprecedented and tunable mechanical properties that surpass those of conventional materials. However, leveraging advanced machine learning (ML) for metamaterial discovery is hindered by three fundamental challenges: (C1) Data Heterogeneity Challenge arises from heterogeneous data sources, heterogeneous composition scales, and heterogeneous structure categories; (C2) Model Complexity Challenge stems from the intricate geometric constraints of ML models, which complicate their adaptation to metamaterial structures; and (C3) Human-AI Collaboration Challenge comes from the "dual black-box'' nature of sophisticated ML models and the need for intuitive user interfaces. To tackle these challenges, we introduce a unified framework, named MetamatBench, that operates on three levels. (1) At the data level, we integrate and standardize 5 heterogeneous, multi-modal metamaterial datasets. (2) The ML level provides a comprehensive toolkit that adapts 17 state-of-the-art ML methods for metamaterial discovery. It also includes a comprehensive evaluation suite with 12 novel performance metrics with finite element-based assessments to ensure accurate and reliable model validation. (3) The user level features a visual-interactive interface that bridges the gap between complex ML techniques and non-ML researchers, advancing property prediction and inverse design of metamaterials for research and applications. MetamatBench offers a unified platform deployed at http://zhoulab-1.cs.vt.edu:5550 that enables machine learning researchers and practitioners to develop and evaluate new methodologies in metamaterial discovery. For accessibility and reproducibility, we open-source our benchmark and the codebase at https://github.com/cjpcool/Metamaterial-Benchmark.
中文:MetamatBench是一个统一框架,通过整合多模态数据集、适配机器学习工具并提供直观界面,解决了超材料发现中的关键挑战,弥合了复杂机器学习技术与研究人员之间的鸿沟。
English: MetamatBench is a unified framework that addresses key challenges in metamaterial discovery by integrating multi-modal datasets, adapting machine learning tools, and providing an intuitive interface to bridge the gap between complex ML techniques and researchers.

Authors:Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Abstract:
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.
Chinese: 本文提出扩散步长退火(DiSA)方法,通过随生成令牌增多而逐步减少扩散步数,在保持图像生成质量的同时,将自回归模型的推理速度最高提升10倍。
English: This paper introduces diffusion step annealing (DiSA), a training-free method that accelerates autoregressive image generation by progressively reducing diffusion steps as more tokens are generated, achieving up to 10x faster inference while preserving quality.

Authors:Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
Abstract:
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. To support the development of this universal form of LLM uncertainties, we publish our metric at https://github.com/apple/ml-selfreflect
This research introduces SelfReflect, a novel metric for evaluating how accurately a string summarizes a large language model's internal uncertainty distribution over possible outputs, demonstrating its superiority over existing methods and its alignment with human judgment.
English Summary:

Authors:Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan
Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
Abstract:
Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
中文: 本研究提出了OpenS2V-Nexus这一完整基础设施,通过细粒度基准和大规模数据集来评估和提升视频生成中的主体一致性与自然度。
English: The study introduces OpenS2V-Nexus, a comprehensive infrastructure for Subject-to-Video generation featuring a fine-grained benchmark and a large-scale dataset to evaluate and enhance subject consistency and naturalness in videos.

Authors:Di Wu, Yixin Wan, Kai-Wei Chang
Title: Visualized Text-to-Image Retrieval
Abstract:
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
Chinese: VisRet提出了一种新的文本到图像检索范式,通过先将文本查询转换为图像,然后在图像模态内进行检索,显著提升了检索性能24.5%至32.7%,并有效增强了下游任务的准确性。
English: VisRet introduces a novel Text-to-Image retrieval approach by first converting text queries into images and then retrieving within the image modality, significantly improving performance by 24.5% to 32.7% across benchmarks and benefiting downstream tasks.

Authors:Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei
Title: Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
Abstract:
Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.
中文: 本文提出分层掩码自回归模型(Hi-MAR),通过先预测低分辨率图像标记来捕捉全局结构,再将其作为密集标记生成的引导,以更高效率超越了标准自回归方法的视觉生成性能。
English: This paper introduces Hierarchical Masked Autoregressive models (Hi-MAR), which enhance visual generation by predicting low-resolution image tokens first to capture global structure, then using them as guidance for dense token generation, outperforming standard autoregressive methods with greater efficiency.

Authors:Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, Mengdi Wang
Title: Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
Abstract:
Recent advances in large language models (LLMs) have enabled agents to autonomously perform complex, open-ended tasks. However, many existing frameworks depend heavily on manually predefined tools and workflows, which hinder their adaptability, scalability, and generalization across domains. In this work, we introduce Alita--a generalist agent designed with the principle of "Simplicity is the ultimate sophistication," enabling scalable agentic reasoning through minimal predefinition and maximal self-evolution. For minimal predefinition, Alita is equipped with only one component for direct problem-solving, making it much simpler and neater than previous approaches that relied heavily on hand-crafted, elaborate tools and workflows. This clean design enhances its potential to generalize to challenging questions, without being limited by tools. For Maximal self-evolution, we enable the creativity of Alita by providing a suite of general-purpose components to autonomously construct, refine, and reuse external capabilities by generating task-related model context protocols (MCPs) from open source, which contributes to scalable agentic reasoning. Notably, Alita achieves 75.15% pass@1 and 87.27% pass@3 accuracy, which is top-ranking among general-purpose agents, on the GAIA benchmark validation dataset, 74.00% and 52.00% pass@1, respectively, on Mathvista and PathVQA, outperforming many agent systems with far greater complexity. More details will be updated at $\href{https://github.com/CharlesQ9/Alita}{https://github.com/CharlesQ9/Alita}$.
Chinese: Alita是一种通用智能体,通过最小化预定义工具和最大化自我进化,提升了适应性和可扩展性,在GAIA、Mathvista和PathVQA等基准测试中取得了顶尖性能。
English: Alita is a generalist agent that enhances adaptability and scalability by minimizing predefined tools and maximizing self-evolution, achieving top-tier performance on benchmarks like GAIA, Mathvista, and PathVQA.

Authors:Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
Title: MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability
Abstract:
Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
中文: 提出的MaskSearch框架通过名为检索增强掩码预测的新型预训练任务,结合监督微调和强化学习,显著提升了大型语言模型在领域内和跨域任务中的通用搜索能力。
English: The proposed MaskSearch framework enhances large language models' universal search capabilities through a novel pre-training task called Retrieval Augmented Mask Prediction, which combines supervised fine-tuning and reinforcement learning to significantly improve performance on both in-domain and out-of-domain tasks.

Authors:Zitian Gao, Lynx Chen, Haoming Luo, Joey Zhou, Bryan Dai
Title: One-shot Entropy Minimization
Abstract:
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
中文摘要:通过对13,440个大语言模型的训练发现,仅需一个未标记数据和10步优化,熵最小化方法就能达到甚至超越基于规则的强化学习使用数千数据和精心设计奖励的效果,这一突破性成果可能促使人们重新思考大语言模型的后训练范式。
English Summary: Training 13,440 large language models revealed that entropy minimization with just one unlabeled data point and 10 optimization steps can match or surpass rule-based reinforcement learning using thousands of data points, potentially reshaping post-training approaches for such models.

Authors:Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Abstract:
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
中文:OmniCharacter是一种首创的语音-语言个性交互模型,通过低延迟使角色扮演智能体持续展现角色特定特质和声音特征,借助丰富数据集创造沉浸式体验,在内容和风格上均优于现有方法。
English: OmniCharacter is a pioneering speech-language personality interaction model that enables Role-Playing Agents to consistently exhibit role-specific traits and vocal characteristics with low latency, creating immersive experiences through a comprehensive dataset and outperforming existing methods in both content and style.

Authors:Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan
Title: ImgEdit: A Unified Image Editing Dataset and Benchmark
Abstract:
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.
中文: 摘要介绍了ImgEdit,这是一个大规模高质量的图像编辑数据集和基准,旨在通过提供精心筛选的编辑对和全面评估工具来弥补开源模型的不足,其训练的ImgEdit-E1模型展现出卓越性能。
English: The abstract introduces ImgEdit, a large-scale, high-quality image-editing dataset and benchmark designed to address limitations in open-source models by providing curated edit pairs and comprehensive evaluation tools, with the trained model ImgEdit-E1 demonstrating superior performance.

Authors:Jinsheng Quan, Chunshi Wang, Yawei Luo
Title: ParticleGS: Particle-Based Dynamics Modeling of 3D Gaussians for Prior-free Motion Extrapolation
Abstract:
This paper aims to model the dynamics of 3D Gaussians from visual observations to support temporal extrapolation. Existing dynamic 3D reconstruction methods often struggle to effectively learn underlying dynamics or rely heavily on manually defined physical priors, which limits their extrapolation capabilities. To address this issue, we propose a novel dynamic 3D Gaussian Splatting prior-free motion extrapolation framework based on particle dynamics systems. The core advantage of our method lies in its ability to learn differential equations that describe the dynamics of 3D Gaussians, and follow them during future frame extrapolation. Instead of simply fitting to the observed visual frame sequence, we aim to more effectively model the gaussian particle dynamics system. To this end, we introduce a dynamics latent state vector into the standard Gaussian kernel and design a dynamics latent space encoder to extract initial state. Subsequently, we introduce a Neural ODEs-based dynamics module that models the temporal evolution of Gaussian in dynamics latent space. Finally, a Gaussian kernel space decoder is used to decode latent state at the specific time step into the deformation. Experimental results demonstrate that the proposed method achieves comparable rendering quality with existing approaches in reconstruction tasks, and significantly outperforms them in future frame extrapolation. Our code is available at https://github.com/QuanJinSheng/ParticleGS.
中文: 本文提出了一种新颖的动态3D高斯泼溅框架,通过神经微分方程学习粒子动力学系统,在保持重建质量的同时实现了显著优于现有方法的时序外推能力。
English: This paper introduces a novel dynamic 3D Gaussian Splatting framework that learns underlying particle dynamics through Neural ODEs, enabling superior temporal extrapolation while maintaining competitive reconstruction quality.

Authors:Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
Title: Lifelong Safety Alignment for Language Models
Abstract:
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.
中文: 本文提出了一种终身安全对齐框架,通过元攻击器与防御器的竞争机制,使大语言模型持续适应新型越狱攻击,初始攻击成功率高达73%,经迭代训练后降至7%,从而提升开放环境下的部署安全性。
English: This paper introduces a lifelong safety alignment framework that uses a competitive Meta-Attacker and Defender to continuously adapt LLMs to evolving jailbreaking attacks, achieving a 73% attack success rate initially and reducing it to 7% after iterative training for safer deployment.

Authors:Muyao Niu, Mingdeng Cao, Yifan Zhan, Qingtian Zhu, Mingze Ma, Jiancheng Zhao, Yanhong Zeng, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Title: AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models
Abstract:
Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce \textbf{AniCrafter}, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative ''avatar-background'' conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes are available at https://github.com/MyNiuuu/AniCrafter.
Chinese: AniCrafter提出了一种基于扩散的人物动画模型,通过创新的“角色-背景”调节机制,能在动态开放域场景中无缝生成人物动画,其性能优于现有方法。
English: AniCrafter introduces a diffusion-based human animation model that uses an innovative avatar-background conditioning mechanism to seamlessly animate characters in dynamic open-domain scenes, outperforming existing methods.

Authors:Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
Title: DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
Abstract:
Reasoning has improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from quality imbalance, which degrades PRM performance and highlights the need for data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses data selection methods and yields higher accuracy gains than existing test-time scaling approaches. Codes are available at https://github.com/coder-qicao/DreamPRM.
中文:DreamPRM是一种领域重加权训练框架,通过双层优化优先处理高质量数据,提升了多模态推理的泛化能力和性能。
English: DreamPRM is a domain-reweighted training framework that enhances multimodal reasoning by prioritizing high-quality data through bi-level optimization, improving generalization and performance across various benchmarks.

Authors:Pranav Poudel, Aavash Chhetri, Prashnna Gyawali, Georgios Leontidis, Binod Bhattarai
Title: Multimodal Federated Learning With Missing Modalities through Feature Imputation Network
Abstract:
Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: https://github.com/bhattarailab/FedFeatGen
中文摘要:本文提出了一种轻量级特征转换器,用于在医疗多模态联邦学习中重构缺失模态,在三个医疗数据集上均持续优于基线方法。
English Summary: The paper introduces a lightweight feature translator for reconstructing missing modalities in multimodal federated learning for healthcare, which consistently outperforms baseline methods across three medical datasets.

Authors:Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Title: From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
Abstract:
Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.
中文摘要:本文提出了一种可扩展的框架,通过结合归因修补和语义对齐来解读CLIP模型的内部机制,揭示了潜在组件如何驱动预测,并在多个模型变体中发现了令人惊讶的语义关联和隐藏的捷径。
English Summary: This paper introduces a scalable framework for interpreting CLIP models by combining attribution patching with semantic alignment to reveal how latent components drive predictions, uncovering surprising semantic associations and hidden shortcuts across multiple model variants.

Authors:Hao Kang, Zichun Yu, Chenyan Xiong
Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Abstract:
Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.
中文:FLAME-MoE是一个完全开源的研究平台,提供采用专家混合架构的仅解码器模型,在比密集基线准确率提升最高3.4个百分点的同时,实现了对专家专业化和路由行为的透明化分析。
English: FLAME-MoE is a fully open-source research suite that provides decoder-only models with Mixture-of-Experts architecture, improving accuracy by up to 3.4 points over dense baselines while enabling transparent analysis of expert specialization and routing behavior.

Authors:Yixin Cui, Haotian Lin, Shuo Yang, Yixiao Wang, Yanjun Huang, Hong Chen
Title: Chain-of-Thought for Autonomous Driving: A Comprehensive Survey and Future Prospects
Abstract:
The rapid evolution of large language models in natural language processing has substantially elevated their semantic understanding and logical reasoning capabilities. Such proficiencies have been leveraged in autonomous driving systems, contributing to significant improvements in system performance. Models such as OpenAI o1 and DeepSeek-R1, leverage Chain-of-Thought (CoT) reasoning, an advanced cognitive method that simulates human thinking processes, demonstrating remarkable reasoning capabilities in complex tasks. By structuring complex driving scenarios within a systematic reasoning framework, this approach has emerged as a prominent research focus in autonomous driving, substantially improving the system's ability to handle challenging cases. This paper investigates how CoT methods improve the reasoning abilities of autonomous driving models. Based on a comprehensive literature review, we present a systematic analysis of the motivations, methodologies, challenges, and future research directions of CoT in autonomous driving. Furthermore, we propose the insight of combining CoT with self-learning to facilitate self-evolution in driving systems. To ensure the relevance and timeliness of this study, we have compiled a dynamic repository of literature and open-source projects, diligently updated to incorporate forefront developments. The repository is publicly available at https://github.com/cuiyx1720/Awesome-CoT4AD.
中文: 本文探讨了思维链推理如何通过提升语义理解和逻辑推理能力来增强自动驾驶系统,同时提出结合自学习以实现自我进化,并提供了持续更新的相关研究资源库。
English: This paper explores how Chain-of-Thought reasoning enhances autonomous driving systems by improving their semantic understanding and logical reasoning, while also proposing a self-learning integration for self-evolution and providing an updated repository of related research.

Authors:Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng
Title: Fine-grained List-wise Alignment for Generative Medication Recommendation
Abstract:
Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.
中文摘要:FLAME提出了一种细粒度列表对齐框架,通过顺序决策建模和基于潜在奖励的逐步优化,在大语言模型中实现了精准可控的药物推荐,在临床场景中展现出卓越的准确性和安全性。
English Summary: FLAME introduces a fine-grained list-wise alignment framework for large language models that optimizes medication recommendations through sequential decision-making and step-wise reward shaping, achieving state-of-the-art accuracy and safety in clinical applications.

Authors:Pengxiang Li, Shilin Yan, Joey Tsai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
Title: Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
Abstract:
Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model's instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG's corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.
Chinese: 自适应无分类器引导(A-CFG)通过基于模型实时预测置信度动态调整无条件输入,在迭代生成过程中聚焦于不确定标记进行校正引导,从而显著提升生成性能,如在GPQA上实现了3.9分的提升。
English: Adaptive Classifier-Free Guidance (A-CFG) improves generative model controllability by dynamically adjusting unconditional inputs based on the model's real-time predictive confidence, focusing corrective guidance on uncertain tokens during iterative generation and achieving significant performance gains, such as a 3.9-point increase on GPQA.

Authors:Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang
Title: FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement
Abstract:
The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason
中文: FunReason是一个通过自动化数据优化和自优化多尺度损失方法增强大语言模型函数调用能力的新框架,在保持推理过程与函数调用准确性平衡的同时,实现了与GPT-4o相当的性能。
English: FunReason is a novel framework that enhances large language models' function calling capabilities through automated data refinement and a Self-Refinement Multiscale Loss approach, achieving performance comparable to GPT-4o while balancing reasoning processes with function call accuracy.

Authors:Shubham Gandhi, Atharva Naik, Yiqing Xie, Carolyn Rose
Title: An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
Abstract:
We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.
Chinese Summary: 研究表明,强弱语言模型的策略性协作能在保持强模型性能的同时降低40%成本,其中流水线和基于上下文的方法对仓库级代码生成最为高效。
English Summary: Our research demonstrates that strategic collaboration between strong and weak language models can achieve performance equivalent to using only the strong model while reducing costs by 40%, with pipeline and context-based methods proving most efficient for repository-level code generation.

Authors:Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li
Title: MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. https://github.com/THU-KEG/MMGeoLM.
Chinese: 本文提出MMCLIP框架,通过结合图像和文本的困难负样本对比学习增强视觉编码器对几何细节的捕捉能力,最终训练的MMGeoLM模型在几何推理基准上显著优于开源模型,并能与GPT-4o相媲美。
English: This paper introduces MMCLIP, a hard negative contrastive learning framework that enhances vision encoders by combining image-based and text-based negatives to improve fine-grained geometric reasoning, resulting in the MMGeoLM model which outperforms open-source models and rivals GPT-4o on benchmarks.

Authors:Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester
Title: Error Optimization: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks
Abstract:
Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause: an inherent signal decay problem where gradients attenuate exponentially with depth, becoming computationally negligible due to numerical precision constraints. To address this fundamental limitation, we introduce Error Optimization (EO), a novel reparameterization that preserves PC's theoretical properties while eliminating signal decay. By optimizing over prediction errors rather than states, EO enables signals to reach all layers simultaneously and without attenuation, converging orders of magnitude faster than standard PC. Experiments across multiple architectures and datasets demonstrate that EO matches backpropagation's performance even for deeper models where conventional PC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling biologically-inspired learning to deeper architectures on digital hardware and beyond.
Chinese: 本文提出了基于误差的预测编码(ePC),通过重新参数化解决了状态预测编码中的信号衰减问题,使其在深层架构中收敛更快且性能与反向传播相当。
English: This paper introduces error-based Predictive Coding (ePC), a reparameterization that overcomes the signal decay issue in state-based PC, enabling faster convergence and matching backpropagation's performance in deep architectures.

Authors:Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester
Title: ePC: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks
Abstract:
Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause and provides a principled solution. We uncover that the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient on digital hardware, due to an inherent signal decay problem that scales exponentially with depth. To address this fundamental limitation, we introduce a novel reparameterization of PC, named error-based PC (ePC), which does not suffer from signal decay. By optimizing over prediction errors rather than states, ePC enables signals to reach all layers simultaneously and unattenuated, converging orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling bio-inspired learning to deeper architectures on digital hardware and beyond.
Chinese: 本文提出了基于误差的预测编码(ePC),通过重新参数化解决了状态预测编码中的信号衰减问题,使其在深层架构中收敛更快且性能与反向传播相当。
English: This paper introduces error-based Predictive Coding (ePC), a reparameterization that overcomes the signal decay issue in state-based PC, enabling faster convergence and matching backpropagation's performance in deep architectures.

Authors:Jin Zhu, Jingyi Li, Hongyi Zhou, Yinan Lin, Zhenhua Lin, Chengchun Shi
Title: Balancing Interference and Correlation in Spatial Experimental Designs: A Causal Graph Cut Approach
Abstract:
This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a city-scale ridesharing market, further validate the effectiveness of our design. A python implementation of our method is available at https://github.com/Mamba413/CausalGraphCut.
本文提出了一种新的均方误差替代函数,通过图割算法优化空间实验设计,有效处理空间干扰并适应不同协方差结构,从而显著提高因果效应估计的准确性。
This paper introduces a novel surrogate function for the mean squared error to optimize spatial experimental designs, enabling efficient graph cut algorithms to enhance estimator accuracy while accommodating spatial interference and various covariance structures.

Authors:Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Title: Inference-time Alignment in Continuous Space
Abstract:
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea
Chinese: 本文提出简单能量适应(SEA)方法,通过在连续潜空间中进行梯度采样来优化基础策略的原始响应,实现了推理时的高效对齐,在AdvBench和MATH基准测试中显著优于现有基线方法。
English: This paper introduces Simple Energy Adaptation (SEA), a gradient-based sampling method that aligns large language models during inference by optimizing responses in continuous latent space, significantly outperforming existing baselines on benchmarks like AdvBench and MATH.

Authors:Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Title: Inference-time Alignment in Continuous Space
Abstract:
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea
Chinese: 本文提出简单能量适应(SEA)方法,通过在连续潜空间中进行梯度采样来优化基础策略的原始响应,实现了推理时的高效对齐,在AdvBench和MATH基准测试中显著优于现有基线方法。
English: This paper introduces Simple Energy Adaptation (SEA), a gradient-based sampling method that aligns large language models during inference by optimizing responses in continuous latent space, significantly outperforming existing baselines on benchmarks like AdvBench and MATH.

Authors:Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich
Title: Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
Abstract:
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
Chinese: 本文提出ExPLAIND统一框架,整合模型组件、数据和训练轨迹视角,为模型行为和训练动态提供理论支持的解读,并通过分析Transformer中的"顿悟"现象等应用验证其有效性。
English: The paper introduces ExPLAIND, a unified framework that integrates model components, data, and training trajectory perspectives to provide theoretically grounded interpretations of model behavior and training dynamics, validated through applications like analyzing Grokking in Transformers.

Authors:Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich
Title: ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
Abstract:
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
Chinese: 本文提出ExPLAIND统一框架,整合模型组件、数据和训练轨迹视角,为模型行为和训练动态提供理论支持的解读,并通过分析Transformer中的"顿悟"现象等应用验证其有效性。
English: The paper introduces ExPLAIND, a unified framework that integrates model components, data, and training trajectory perspectives to provide theoretically grounded interpretations of model behavior and training dynamics, validated through applications like analyzing Grokking in Transformers.

Authors:Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu
Title: Incentivizing Strong Reasoning from Weak Supervision
Abstract:
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/w2sr.
中文摘要:本研究证明,利用显著弱化模型的监督可有效提升大型语言模型的推理能力,以远低于强化学习的成本实现了后者94%的性能增益。
English summary: This study demonstrates that using supervision from significantly weaker models can effectively enhance the reasoning capabilities of large language models, achieving nearly 94% of the performance gains of expensive reinforcement learning methods at a much lower cost.

Authors:Hongsong Wang, Ao Sun, Jie Gui, Liang Wang
Title: Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay
Abstract:
Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/.
中文: 本研究提出了一种原型引导的伪特征回放框架,用于解决类别增量手势识别问题,有效防止灾难性遗忘,并在基准数据集上以超过11%的准确率优势超越了现有最优方法。
English: This study introduces a Prototype-Guided Pseudo Feature Replay framework to address class-incremental gesture recognition, effectively preventing catastrophic forgetting and outperforming state-of-the-art methods by over 11% in accuracy on benchmark datasets.

Authors:Chang Liu, Haomin Zhang, Shiyu Xia, Zihao Chen, Chaofan Ding, Xin Yue, Huizhe Chen, Xinhan Di
Title: Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Abstract:
Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano, with a continuously updated leaderboard to promote ongoing research in this domain.
中文摘要:CoP基准数据集作为一个完全开源的多模态基准被提出,旨在解决视频生成钢琴音乐中精确同步评估的缺失,其具备详细的多模态注释和灵活评估框架。
English Summary: The CoP Benchmark Dataset is introduced as a fully open-sourced, multimodal benchmark designed to address the lack of precise synchronization evaluation in video-to-piano music generation, featuring detailed annotations and a versatile evaluation framework.

Authors:Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, Huiyong Chen, Dongbin Zhao
Title: ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving
Abstract:
Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.
中文:ReasonPlan框架通过自监督下一场景预测和监督决策思维链的双重机制,增强了多模态大语言模型在闭环自动驾驶中的应用,显著超越主流方法并展现出强大的零样本泛化能力。
English: The ReasonPlan framework enhances multimodal large language models for closed-loop autonomous driving by integrating self-supervised next scene prediction and supervised decision chain-of-thought, achieving significant performance improvements over mainstream methods and demonstrating strong zero-shot generalization.

Authors:Qiong Zhang, Yan Shuo Tan, Qinglong Tian, Pengfei Li
Title: TabPFN: One Model to Rule Them All?
Abstract:
Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim "outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time." Furthermore, they have called TabPFN a "foundation model" for tabular data, as it can support "data generation, density estimation, learning reusable embeddings and fine-tuning". If these statements are well-supported, TabPFN may have the potential to supersede existing modeling approaches on a wide range of statistical tasks, mirroring a similar revolution in other areas of artificial intelligence that began with the advent of large language models. In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We also provide more evidence of TabPFN's "foundation model" capabilities: We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. We further show that TabPFN can outperform LASSO at sparse regression and can break a robustness-efficiency trade-off in classification. All experiments can be reproduced using the code provided at https://github.com/qinglong-tian/tabpfn_study (https://github.com/qinglong-tian/tabpfn_study).
中文: TabPFN是一种基于Transformer的深度学习模型,专为表格数据设计,在回归和分类任务中表现卓越,以更少训练时间大幅超越现有方法,并具备数据生成和微调等基础模型能力。
English: TabPFN is a transformer-based deep learning model for tabular data that excels in regression and classification tasks, outperforming existing methods with less training time and demonstrating foundation model capabilities like data generation and fine-tuning.

Authors:Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi
Title: DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Abstract:
Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.
中文:DFIR-Metric是一个综合性基准,通过知识评估、实战挑战和实际案例分析来严格评估大语言模型在数字取证中的表现,填补了该领域缺乏标准化测试的空白。
English: DFIR-Metric is a comprehensive benchmark designed to rigorously evaluate Large Language Models in digital forensics through knowledge assessment, realistic challenges, and practical analysis, addressing the lack of standardized testing in the field.

Authors:Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Abstract:
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.
Chinese: 本文提出MiniLongBench,通过压缩LongBench基准,将评估成本降至原版的4.5%,同时保持0.97的排名相关性,为大型语言模型的长文本理解研究提供了高效评估方案。
English: This paper introduces MiniLongBench, a compressed version of the LongBench benchmark that reduces evaluation costs to just 4.5% of the original while maintaining a 0.97 rank correlation, enabling more efficient long context understanding research in large language models.

Authors:Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Title: UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space
Abstract:
Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.
中文: UltraVSR通过退化感知重建调度和轻量级循环时序偏移模块,在单步扩散框架中实现了超逼真且时序一致的视频超分辨率。
English: UltraVSR introduces a one-step diffusion framework with Degradation-aware Reconstruction Scheduling and a lightweight Recurrent Temporal Shift module to achieve ultra-realistic, temporally-coherent video super-resolution efficiently.

Authors:Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee
Title: DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
Abstract:
Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.
Chinese: 本文提出了一种新方法,通过构建深度上下文模式链接图来有效检索演示并生成SQL查询,在Spider基准测试中证明该方法能提升超大规模和小型大语言模型的SQL生成性能与效率。
English: This paper introduces a novel approach that constructs a Deep Contextual Schema Link Graph to effectively retrieve demonstrations and generate SQL queries, improving performance and efficiency across both hyper-scaled and small LLMs as demonstrated on the Spider benchmark.

Authors:Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen
Title: MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees
Abstract:
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.
Chinese: MESS+ 是一种随机优化算法,通过实时学习模型性能并解决单请求优化问题,在保证严格服务等级协议合规的同时,实现了大型语言模型请求的成本最优路由。
English: MESS+ is a stochastic optimization algorithm that enables cost-effective routing of large language model requests while ensuring strict compliance with service level agreements by learning model performance in real-time and solving per-request optimization problems.

Authors:Huan Zhang, Fan Lyu, Shuyu Dong, Shenghua Fan, Yujin Zheng, Dingwen Wang
Title: Beyond Freezing: Sparse Tuning Enhances Plasticity in Continual Learning with Pre-Trained Models
Abstract:
Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks. However, most existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters, limiting model plasticity and leading to suboptimal generalization when facing significant distribution shifts. While full fine-tuning can improve adaptability, it risks disrupting crucial pre-trained knowledge. In this paper, we propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters, less than 5%, based on sensitivity to mutual information objectives. MIST enables effective task-specific adaptation while preserving generalization. To further reduce interference, we introduce strong sparsity regularization by randomly dropping gradients during tuning, resulting in fewer than 0.5% of parameters being updated per step. Applied before standard freeze-based methods, MIST consistently boosts performance across diverse continual learning benchmarks. Experiments show that integrating our method into multiple baselines yields significant performance gains. Our code is available at https://github.com/zhwhu/MIST.
中文: 本文提出互信息指导的稀疏调优方法MIST,通过选择性更新预训练模型中不足5%的参数,在持续学习场景下既能有效适应任务又能保持模型的泛化能力。
English: This paper introduces Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates under 5% of pre-trained model parameters to enhance task adaptation while preserving generalization in continual learning scenarios.

Authors:Alejandro Carrasco, Victor Rodriguez-Fernandez, Richard Linares
Title: Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program
Abstract:
Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights \& Biases
中文: 本研究开创性地将大型语言模型作为自主代理应用于空间控制领域,通过提示工程和微调技术在卫星机动竞赛中获得第二名。
English: This research pioneers the use of Large Language Models as autonomous agents for space control, achieving second place in a satellite maneuvering competition through prompt engineering and fine-tuning techniques.

Authors:David Schneider, Zdravko Marinov, Rafael Baur, Zeyun Zhong, Rodi Düger, Rainer Stiefelhagen
Title: OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection
Abstract:
Current video-based fall detection research mostly relies on small, staged datasets with significant domain biases concerning background, lighting, and camera setup resulting in unknown real-world performance. We introduce OmniFall, unifying eight public fall detection datasets (roughly 14 h of recordings, roughly 42 h of multiview data, 101 subjects, 29 camera views) under a consistent ten-class taxonomy with standardized evaluation protocols. Our benchmark provides complete video segmentation labels and enables fair cross-dataset comparison previously impossible with incompatible annotation schemes. For real-world evaluation we curate OOPS-Fall from genuine accident videos and establish a staged-to-wild protocol measuring generalization from controlled to uncontrolled environments. Experiments with frozen pre-trained backbones such as I3D or VideoMAE reveal significant performance gaps between in-distribution and in-the-wild scenarios, highlighting critical challenges in developing robust fall detection systems. OmniFall Dataset at https://huggingface.co/datasets/simplexsigil2/omnifall , Code at https://github.com/simplexsigil/omnifall-experiments
中文: 当前跌倒检测研究因数据集有限和领域偏差而受限,OmniFall通过整合多个数据集并建立标准化评估框架,揭示了受控与真实场景间的显著性能差距。
English: Current fall detection research suffers from limited datasets and domain biases, so OmniFall unifies multiple datasets under a standardized framework and introduces a real-world evaluation protocol, revealing significant performance gaps between controlled and uncontrolled environments.

Authors:Chao Huang, Benfeng Wang, Jie Wen, Chengliang Liu, Wei Wang, Li Shen, Xiaochun Cao
Title: Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
Abstract:
Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at https://github.com/wbfwonderful/Vad-R1.
中文: 本文提出了视频异常推理新任务及Vad-R1框架,通过感知-认知思维链和强化学习算法,增强多模态大语言模型对视频异常的深度推理能力,在多项基准测试中表现优异。
English: This paper introduces a novel Video Anomaly Reasoning (VAR) task and proposes Vad-R1, an end-to-end framework that enhances Multimodal Large Language Models' deep reasoning for video anomalies through a Perception-to-Cognition Chain-of-Thought and improved reinforcement learning, achieving superior performance on VAD and VAR benchmarks.

Authors:Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang
Title: REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.
中文摘要:提出的REA-RL方法通过引入反思模型进行高效在线训练并设计反思奖励,解决了大型推理模型的过度思考问题,在保持性能的同时显著提升35%的推理效率。
English Summary: The proposed REA-RL method addresses overthinking in Large Reasoning Models by combining a reflection model for efficient online training with a reflection reward, achieving 35% inference cost reduction while maintaining performance.

Authors:Haoqiang Yang, Congde Yuan, Kun Bai, Mengzhuo Guo, Wei Yang, Chao Zhou
Title: HIT Model: A Hierarchical Interaction-Enhanced Two-Tower Model for Pre-Ranking Systems
Abstract:
Online display advertising platforms rely on pre-ranking systems to efficiently filter and prioritize candidate ads from large corpora, balancing relevance to users with strict computational constraints. The prevailing two-tower architecture, though highly efficient due to its decoupled design and pre-caching, suffers from cross-domain interaction and coarse similarity metrics, undermining its capacity to model complex user-ad relationships. In this study, we propose the Hierarchical Interaction-Enhanced Two-Tower (HIT) model, a new architecture that augments the two-tower paradigm with two key components: $\textit{generators}$ that pre-generate holistic vectors incorporating coarse-grained user-ad interactions through a dual-generator framework with a cosine-similarity-based generation loss as the training objective, and $\textit{multi-head representers}$ that project embeddings into multiple latent subspaces to capture fine-grained, multi-faceted user interests and multi-dimensional ad attributes. This design enhances modeling effectiveness without compromising inference efficiency. Extensive experiments on public datasets and large-scale online A/B testing on Tencent's advertising platform demonstrate that HIT significantly outperforms several baselines in relevance metrics, yielding a $1.66\%$ increase in Gross Merchandise Volume and a $1.55\%$ improvement in Return on Investment, alongside similar serving latency to the vanilla two-tower models. The HIT model has been successfully deployed in Tencent's online display advertising system, serving billions of impressions daily. The code is available at https://github.com/HarveyYang123/HIT_model.
中文摘要:本研究提出的分层交互增强双塔(HIT)模型通过生成器实现粗粒度交互建模和多头表征器捕捉细粒度特征,在保持推理效率的同时显著提升了广告相关性指标和商业效益,并已在腾讯广告平台成功部署。
English Summary: The study introduces the Hierarchical Interaction-Enhanced Two-Tower (HIT) model, which enhances the traditional two-tower architecture by incorporating generators for coarse-grained interactions and multi-head representers for fine-grained feature capture, significantly improving ad relevance and business metrics while maintaining efficient inference.

Authors:Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y. Yang, Xiao-Yang Liu
Title: FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets
Abstract:
Low-rank adaptation (LoRA) methods show great potential for scaling pre-trained general-purpose Large Language Models (LLMs) to hundreds or thousands of use scenarios. However, their efficacy in high-stakes domains like finance is rarely explored, e.g., passing CFA exams and analyzing SEC filings. In this paper, we present the open-source FinLoRA project that benchmarks LoRA methods on both general and highly professional financial tasks. First, we curated 19 datasets covering diverse financial applications; in particular, we created four novel XBRL analysis datasets based on 150 SEC filings. Second, we evaluated five LoRA methods and five base LLMs. Finally, we provide extensive experimental results in terms of accuracy, F1, and BERTScore and report computational cost in terms of time and GPU memory during fine-tuning and inference stages. We find that LoRA methods achieved substantial performance gains of 36\% on average over base models. Our FinLoRA project provides an affordable and scalable approach to democratize financial intelligence to the general public. Datasets, LoRA adapters, code, and documentation are available at https://github.com/Open-Finance-Lab/FinLoRA
中文:FinLoRA项目表明,LoRA方法在金融任务中显著提升了大型语言模型的性能,平均改进达36%,为普及金融智能提供了经济高效的解决方案。
English: The FinLoRA project demonstrates that LoRA methods significantly enhance LLMs' performance in financial tasks, achieving a 36% average improvement and offering an affordable solution for democratizing financial intelligence.

Authors:You Wang, Li Fang, Hao Zhu, Fei Hu, Long Ye, Zhan Ma
Title: GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis
Abstract:
Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.
中文:GoLF-NRT是一种新颖的神经渲染变换器,通过融合全局场景上下文与局部几何特征,仅需1-3张输入视图即可实现高质量新视角合成,其自适应采样策略使该方法在性能上超越现有技术。
English: GoLF-NRT is a novel neural rendering transformer that combines global scene context and local geometric features to achieve high-quality novel view synthesis from as few as 1-3 input views, outperforming existing methods through its adaptive sampling strategy.

Authors:Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian
Title: Efficient Multi-modal Long Context Learning for Training-free Adaptation
Abstract:
Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.
中文: 本文提出EMLoC方法,通过将示范示例嵌入模型输入并采用分块压缩与自适应剪枝技术,无需训练即可高效处理多模态长上下文学习,且保持性能无损。
English: This paper introduces EMLoC, a training-free method that embeds demonstration examples into model inputs and uses chunk-wise compression with adaptive pruning to efficiently handle multi-modal long-context learning without performance loss.

Authors:Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Title: Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks
Abstract:
Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at https://github.com/OpenCausaLab/Awesome-LLM-Consciousness.
中文: 本文系统探讨了大型语言模型意识这一新兴领域,厘清了相关术语,整合了现有研究,并指出了潜在风险及未来研究方向。
English: This paper systematically explores the largely uncharted territory of LLM consciousness, clarifying terminology, synthesizing research, and addressing potential risks and future directions in the field.

Authors:Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Title: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
Abstract:
Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.
中文摘要:MOLE框架利用大语言模型自动从非阿拉伯语数据集的科学论文中提取元数据,采用模式驱动处理和验证机制,并建立了新的评估基准。
English Summary: MOLE is a framework that uses Large Language Models to automatically extract metadata from scientific papers for non-Arabic datasets, employing schema-driven processing and validation while introducing a new evaluation benchmark.

Authors:Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Title: The Missing Point in Vision Transformers for Universal Image Segmentation
Abstract:
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
中文: 本文提出ViT-P,一种两阶段分割框架,将掩码生成与分类解耦,利用基于视觉变换器的模型提高精度和适应性,无需预训练,在多个数据集上实现了最先进的性能。
English: This paper introduces ViT-P, a two-stage segmentation framework that decouples mask generation from classification, using a Vision Transformer-based model to enhance accuracy and adaptability without pre-training, achieving state-of-the-art results across multiple datasets.

Authors:Li Fang, Hao Zhu, Longlong Chen, Fei Hu, Long Ye, Zhan Ma
Title: Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction
Abstract:
Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Recognizing that natural scenes are typically piecewise smooth and sampling all rays is often redundant, we propose a novel depth-guided bundle sampling strategy to accelerate rendering. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. To further optimize efficiency, our adaptive sampling strategy dynamically allocates samples based on depth confidence, concentrating more samples in complex regions while reducing them in smoother areas. When applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to 2x faster rendering compared to existing generalizable methods. Code is available at https://github.com/KLMAV-CUC/GDB-NeRF.
Chinese: 本文提出了一种深度引导的束采样策略,通过自适应地分组光线和动态分配样本,在多个数据集上实现了更优的渲染质量和最高达两倍的加速效果。
English: This paper introduces a depth-guided bundle sampling strategy that accelerates novel view synthesis by adaptively grouping rays and dynamically allocating samples, achieving superior rendering quality and up to twice the speed on multiple datasets.

Authors:Mobina Mansoori, Sajjad Shahabodini, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Title: Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models
Abstract:
Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: https://github.com/sajjad-sh33/Medical-Transfer-Learning.
中文: 本研究评估了DINOv2和AIMv2等先进基础模型在医学图像数据集上的表现,结果表明这些模型显著提升了分类准确性和鲁棒性,即使在标注数据有限的情况下也表现优异。
English: This study evaluates advanced foundation models like DINOv2 and AIMv2 on medical image datasets, showing they significantly improve classification accuracy and robustness, even with limited labeled data.

Authors:Patara Trirat, Wonyong Jeong, Sung Ju Hwang
Title: Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
Chinese: 本文提出Agentic Predictor,一种轻量级预测器,通过多视图编码和跨域预训练技术,能高效评估并选择最优的大语言模型智能体工作流配置,显著减少昂贵试错评估需求,同时提升预测准确性和工作流效用。
English: This paper introduces Agentic Predictor, a lightweight tool that uses multi-view encoding and cross-domain pretraining to efficiently evaluate and select optimal LLM-based agentic workflow configurations, reducing the need for costly trial-and-error evaluations while improving accuracy and utility.

Authors:Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang
Title: Efficient Reasoning via Chain of Unconscious Thought
Abstract:
Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: https://github.com/Rohan-GRH/CoUT
中文摘要:无意识思维链(CoUT)通过内化推理过程,显著提升大型推理模型的令牌效率,在保持与思维链方法相近准确率的同时,将令牌使用量减少47.62%。
English Summary: The Chain of Unconscious Thought (CoUT) paradigm enhances token efficiency in Large Reasoning Models by internalizing reasoning processes, reducing token usage by 47.62% while maintaining accuracy comparable to Chain of Thought methods.

Authors:Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu
Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
Abstract:
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.
Chinese: NeuSym-RAG 是一种混合神经符号检索框架,通过多视图分块和基于模式的解析将两种检索范式结合,使LLM代理能从PDF中迭代收集上下文,并在多个问答数据集上稳定超越现有方法。
English: NeuSym-RAG is a hybrid neural-symbolic retrieval framework that integrates both retrieval paradigms through multi-view chunking and schema-based parsing, enabling LLM agents to iteratively gather context from PDFs and outperforming existing methods on multiple QA datasets.

Authors:Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Title: Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Abstract:
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
中文: 提出的微令牌级接受-拒绝对齐(MARA)方法通过将句子级偏好学习转化为令牌级二元分类,在多个模型和数据集上显著提升了对齐性能并降低了计算成本。
English: The proposed Micro token-level Accept-Reject Aligning (MARA) method efficiently aligns large language models with human preferences by converting sentence-level learning into token-level classification, significantly improving performance while reducing computational costs across multiple models and datasets.

Authors:Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Title: HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance
Abstract:
Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.
中文摘要:本研究提出HAODiff模型,通过三重分支双提示引导技术,在单步扩散中有效修复同时存在运动模糊和通用噪声的人像图像,在合成与真实数据集上均超越现有最优方法。
English Summary: The study introduces HAODiff, a human-aware one-step diffusion model enhanced with triple-branch dual-prompt guidance, which effectively restores human-centered images degraded by both motion blur and generic noise, outperforming existing methods on synthetic and real-world benchmarks.

Authors:Yifan Wu, Jingze Shi, Bingheng Wu, Jiayi Zhang, Xiaotian Lin, Nan Tang, Yuyu Luo
Title: Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
Abstract:
Existing chain-of-thought (CoT) distillation methods can effectively transfer reasoning abilities to base models but suffer from two major limitations: excessive verbosity of reasoning traces and inadequate adaptability to problem difficulty. Long reasoning traces significantly increase inference costs, and uniform-length solutions prevent base models from learning adaptive reasoning strategies. To address these issues, we propose a difficulty-aware prompting (DAP) method to dynamically shorten reasoning traces without performance loss. In our approach, a large teacher model first judges each problem's difficulty and then rewrites its reasoning traces to an appropriate shorter length, yielding concise yet complete reasoning traces. Leveraging the DAP pipeline, we curate a distilled dataset called LiteCoT consisting of 100K concise reasoning examples, with solutions averaging only 720 tokens (an order of magnitude shorter than typical CoTs). Using LiteCoT, we distilled a new family of reasoning models called Liter (1.5B, 7B, and 32B) based on the Qwen2.5 architecture. Experiments show that a student model fine-tuned on just 100K of these difficulty-pruned CoT samples outperforms a model distilled on 800K original Long CoT samples, while significantly reducing training and inference costs. Our method also generalizes well: across 11 diverse benchmarks, the shorter difficulty-aware CoTs achieve equal or better accuracy than Long chains, using far fewer tokens. For example, on the challenging AIME24 exam, our approach reaches $74.2\%$ Pass@1 using only about 5K inference tokens, surpassing other methods that consume many more tokens. Our code and data are available at https://github.com/Evanwu1125/LiteCoT.
中文: 本研究提出难度感知提示方法,通过动态缩短推理轨迹生成精简而完整的训练数据,使蒸馏模型在多项基准测试中以更低计算成本获得更优性能。
English: This study introduces a difficulty-aware prompting method that dynamically shortens reasoning traces to create concise yet effective training data, enabling distilled models to achieve superior performance with significantly reduced computational costs across multiple benchmarks.

Authors:Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, Soujanya Poria
Title: Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
Abstract:
Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.
中文: PathFinder-PRM作为一种分层过程奖励模型,通过检测步骤级错误并整合信号进行奖励估计,显著提升了数学推理的精确性和数据效率,实现了最优性能。
English: PathFinder-PRM is a hierarchical process reward model that enhances mathematical reasoning by detecting step-level errors and combining them for reward estimation, achieving state-of-the-art performance with improved data efficiency.

Authors:Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
Title: Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments
Abstract:
Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.
中文: 该摘要提出了Mosaic框架,通过训练本地生成模型合成数据并构建专家混合模型蒸馏为全局模型,有效解决了联邦学习中的模型与数据异构性问题,实验证明其性能优于现有先进方法。
English: This abstract introduces Mosaic, a data-free knowledge distillation framework designed to address model and data heterogeneity in Federated Learning by training local generative models for synthetic data generation and forming a Mixture-of-Experts distilled into a global model, which outperforms state-of-the-art methods in experiments.

Authors:Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng
Title: DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving
Abstract:
Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.
Chinese: DriveCamSim提出了一种创新的相机仿真框架,通过显式相机建模和信息保留控制机制,实现了可适应不同相机参数与帧率的高质量灵活视频生成,突破了现有方法的局限性。
English: DriveCamSim introduces a novel camera simulation framework with Explicit Camera Modeling and an information-preserving control mechanism to enable flexible, high-quality video generation adaptable to various camera parameters and frame rates, overcoming limitations of existing methods.

Authors:Xinrui Wang, Shao-yuan Li, Jiaqiang Zhang, Songcan Chen
Title: Cut out and Replay: A Simple yet Versatile Strategy for Multi-Label Online Continual Learning
Abstract:
Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \href{https://github.com/wxr99/Cut-Replay}{https://github.com/wxr99/Cut-Replay}
中文: 本文提出CUTER策略,通过识别和回放标签特定区域,有效解决了多标签在线持续学习中的灾难性遗忘、标签缺失和类别不平衡问题,在多个基准测试中表现出优越性能。
English: This paper introduces CUTER, a novel strategy for Multi-Label Online Continual Learning that addresses catastrophic forgetting, missing labels, and class imbalance by identifying and replaying label-specific regions, demonstrating superior performance across multiple benchmarks.

Authors:Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Title: GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models
Abstract:
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
中文摘要:GenKI框架通过将知识库中的知识整合到大型语言模型中并实现可控生成,有效提升了开放域问答性能,在多个数据集上表现优异,并揭示了知识检索频率与准确回忆能力之间的线性关系。
English Summary: The GenKI framework enhances OpenQA by integrating knowledge from a knowledge base into LLMs and enabling controllable generation, demonstrating superior performance across diverse datasets and revealing a linear relationship between knowledge frequency and recall accuracy.

Authors:Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Title: Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation on Large Language Models
Abstract:
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
中文摘要:GenKI框架通过将知识库中的知识整合到大型语言模型中并实现可控生成,有效提升了开放域问答性能,在多个数据集上表现优异,并揭示了知识检索频率与准确回忆能力之间的线性关系。
English Summary: The GenKI framework enhances OpenQA by integrating knowledge from a knowledge base into LLMs and enabling controllable generation, demonstrating superior performance across diverse datasets and revealing a linear relationship between knowledge frequency and recall accuracy.

Authors:Piyush Tiwary, Kinjawl Bhattacharyya, Prathosh A. P
Title: LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation
Abstract:
Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DAug). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DAug methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novel $\textbf{Lang}$evin $\textbf{D}$ata $\textbf{Aug}$mentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available at https://github.com/backpropagator/LangDAug.
中文:LangDAug是一种新颖的医学图像分割领域泛化方法,通过朗之万动力学和基于能量的模型在源域间生成中间样本,在实验中表现出优越性能并具有理论正则化保证。
English: LangDAug is a novel domain generalization method for medical image segmentation that uses Langevin dynamics and energy-based models to generate intermediate samples between source domains, demonstrating superior performance and providing theoretical regularization guarantees.

Authors:Xiaochuan Liu, Ruihua Song, Xiting Wang, Xu Chen
Title: Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation
Abstract:
Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.
中文: 本研究提出了一种基于全文的多智能体框架,通过图感知策略优化参考文献间的关联理解,在自动生成相关工作部分中实现了最优性能。
English: This study introduces a multi-agent framework for automatic related work generation that utilizes full-text analysis and graph-aware strategies to enhance comprehension and relationship mapping among references, achieving state-of-the-art performance.

Authors:Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He
Title: SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Abstract:
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.
中文: SynLogic框架生成可扩展且可验证的逻辑推理数据,通过强化学习训练提升大语言模型的推理能力,在结合数学和编程任务时不仅实现了最优性能,还显著增强了推理的泛化能力。
English: The SynLogic framework generates scalable, verifiable logical reasoning data that enhances LLMs' reasoning through RL training, achieving state-of-the-art performance and improving generalization when combined with mathematical and coding tasks.

Authors:Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Title: Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Abstract:
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
中文摘要:本研究提出了一种基于大型语言模型的无监督分词新框架,并开发了LLACA方法,通过结合Aho-Corasick自动机与LLM的深度理解能力,实现了能根据上下文动态调整的n-gram模型,在多语言分词任务上显著优于传统方法。
English Summary: This study introduces a novel unsupervised word segmentation framework leveraging Large Language Models (LLMs) and proposes LLACA, a method combining Aho-Corasick automata with LLM insights to dynamically adapt n-gram models for enhanced performance across multiple languages.

Authors:Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li
Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue
Abstract:
Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL
中文摘要:该研究提出的基于强化学习的多智能体协作框架,通过动态优化问诊策略提升临床诊断准确性,为优化医疗资源配置开辟了新途径。
English Summary: The proposed reinforcement learning-based multi-agent framework enhances clinical consultations by enabling dynamic questioning strategies that improve diagnostic accuracy and optimize medical resource allocation.

Authors:Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang
Title: HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices
Abstract:
Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at https://github.com/BITHLP/HomeBench.
中文摘要:本文提出了首个针对智能家居助手中无效和多设备指令挑战的数据集HomeBench,揭示了即使如GPT-4o等先进模型在现有增强技术下仍难以应对复杂现实场景。
English Summary: This paper introduces HomeBench, the first dataset addressing the challenges of invalid and multi-device instructions for LLM-based smart home assistants, revealing that even advanced models like GPT-4o struggle with complex real-world scenarios despite existing enhancement techniques.

Authors:Jiawen Chen, Qi Shao, Duxin Chen, Wenwu Yu
Title: Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs
Abstract:
Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs.
中文: STH-SepNet提出了一种创新框架,通过轻量级大语言模型处理时间动态和自适应超图神经网络捕捉空间交互,实现了时空解耦建模,在多个基准测试中显著提升了预测性能并保持了计算效率。
English: STH-SepNet introduces a novel framework that decouples temporal and spatial modeling using lightweight large language models for temporal dynamics and adaptive hypergraph neural networks for spatial interactions, achieving enhanced predictive performance and computational efficiency across multiple benchmarks.

Authors:Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun
Title: Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning
Abstract:
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.
中文: 当前多模态模型在识别与背景融合的伪装物体时与人类视觉系统存在显著差异,为此提出的视觉重聚焦强化框架通过逐步推理机制提升了模型性能,在伪装感知任务中甚至超越了人类表现。
English: Current multi-modal models struggle to detect camouflaged objects due to misalignment with human visual cognition, prompting the development of a refocus reinforcement framework that enhances reasoning and surpasses human performance in camouflaged perception tasks.

Authors:Ho Hin Lee, Quan Liu, Shunxing Bao, Yuankai Huo, Bennett A. Landman
Title: Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging
Abstract:
In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at https://github.com/leeh43/Rep3D.
Chinese: Rep3D提出了一种带有可学习空间先验的3D卷积框架,通过自适应重加权大核训练中的内核更新,在3D医学图像分割任务中实现了优于现有方法的性能与可扩展性。
English: Rep3D introduces a 3D convolutional framework with a learnable spatial prior that adaptively re-weights kernel updates during large kernel training, achieving improved performance and scalability over existing methods in 3D medical image segmentation.

Authors:Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen
Title: Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
Abstract:
The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accelerating training with Dual Modality Alignment. Our method introduces a novel alignment pipeline leveraging both text and speech modalities: text-guided alignment, which incorporates contextual representations, and speech-guided alignment, which refines semantic representations. By aligning hidden states with discriminative features, our training scheme reduces the reliance on diffusion models for learning complex representations. Extensive experiments demonstrate that A-DMA doubles the convergence speed while achieving superior performance over baselines. Code and demo samples are available at: https://github.com/ZhikangNiu/A-DMA
中文: 本文提出A-DMA双模态对齐策略,通过文本与语音引导的隐状态对齐加速扩散式文本转语音模型的训练,在实现更优性能的同时使收敛速度提升一倍。
English: This paper introduces A-DMA, a dual modality alignment strategy that accelerates diffusion-based text-to-speech training by aligning hidden states with discriminative features from text and speech, doubling convergence speed while outperforming baselines.

Authors:Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Title: Learning to Reason without External Rewards
Abstract:
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
中文摘要:提出的Intuitor框架通过将模型自身置信度作为内在奖励信号,使大语言模型能够在无需外部监督的情况下学习复杂推理,在保持与监督方法相当性能的同时,实现了跨领域任务的更优泛化能力。
English Summary: The proposed Intuitor framework enables large language models to learn complex reasoning through self-certainty as an intrinsic reward signal, achieving comparable performance to supervised methods while demonstrating superior generalization across domains without requiring external supervision.

Authors:Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei
Title: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
Abstract:
While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
中文: VTBench作为一个分层基准套件被提出,通过弥补现有指标的不足、测试集的局限性,并在多样化现实场景中实现与人类感知一致的评价,系统性地评估虚拟试穿模型。
English: VTBench is introduced as a hierarchical benchmark suite to systematically evaluate virtual try-on models by addressing gaps in current metrics, test set limitations, and the need for human-aligned assessments across diverse real-world scenarios.

Authors:Ying Xiao, Jie Huang, Ruijuan He, Jing Xiao, Mohammad Reza Mousavi, Yepang Liu, Kezhi Li, Zhenpeng Chen, Jie M. Zhang
Title: AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare
Abstract:
Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.
Chinese Summary: 大型语言模型在医学诊断中虽达专家水平却存在危险偏见,为此开发的AMQA对抗性数据集显示:即使表现最佳的GPT-4.1模型,对特权群体的回答准确率仍比非特权群体高出10%以上,暴露出医疗AI的严重公平性问题。
English Summary: Large language models (LLMs) demonstrate expert-level medical diagnostic accuracy but exhibit dangerous biases, leading to the creation of AMQA—an adversarial dataset that reveals significant performance disparities between privileged and unprivileged groups, with GPT-4.1 showing over 10% accuracy gaps.

Authors:Yuan Feng, Yukun Cao, Hairu Wang, Xike Xie, S Kevin Zhou
Title: Lego Sketch: A Scalable Memory-augmented Neural Network for Sketching Data Streams
Abstract:
Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs.However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the {\it Lego sketch}, a novel sketch with superior scalability and accuracy. Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains. Our theoretical analysis guarantees its high scalability and provides the first error bound for neural sketch. Furthermore, extensive experimental evaluations demonstrate that the Lego sketch exhibits superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches. Our code is available at https://github.com/FFY0/LegoSketch_ICML.
中文: Lego sketch 提出了一种可扩展的记忆增强神经网络架构,通过动态协调模块化记忆块,实现了跨数据域和空间预算的卓越适应性,在空间-精度权衡上优于现有草图方法。
English: The Lego sketch introduces a scalable memory-augmented neural network architecture that dynamically coordinates modular memory bricks to achieve superior adaptability across data domains and space budgets, outperforming existing sketches in space-accuracy trade-offs.

Authors:Jianan Lou, Rong Zhang
Title: LF-GNSS: Towards More Robust Satellite Positioning with a Hard Example Mining Enhanced Learning-Filtering Deep Fusion Framework
Abstract:
Global Navigation Satellite System (GNSS) is essential for autonomous driving systems, unmanned vehicles, and various location-based technologies, as it provides the precise geospatial information necessary for navigation and situational awareness. However, its performance is often degraded by Non-Line-Of-Sight (NLOS) and multipath effects, especially in urban environments. Recently, Artificial Intelligence (AI) has been driving innovation across numerous industries, introducing novel solutions to mitigate the challenges in satellite positioning. This paper presents a learning-filtering deep fusion framework for satellite positioning, termed LF-GNSS. The framework utilizes deep learning networks to intelligently analyze the signal characteristics of satellite observations, enabling the adaptive construction of observation noise covariance matrices and compensated innovation vectors for Kalman filter input. A dynamic hard example mining technique is incorporated to enhance model robustness by prioritizing challenging satellite signals during training. Additionally, we introduce a novel feature representation based on Dilution of Precision (DOP) contributions, which helps to more effectively characterize the signal quality of individual satellites and improve measurement weighting. LF-GNSS has been validated on both public and private datasets, demonstrating superior positioning accuracy compared to traditional methods and other learning-based solutions. To encourage further integration of AI and GNSS research, we will open-source the code at https://github.com/GarlanLou/LF-GNSS, and release a collection of satellite positioning datasets for urban scenarios at https://github.com/GarlanLou/LF-GNSS-Dataset.
中文: 本文提出LF-GNSS学习滤波深度融合框架,利用人工智能分析卫星信号特征并结合动态难例挖掘技术,在城市环境中显著提升了定位精度,优于传统方法。
English: This paper introduces LF-GNSS, a learning-filtering deep fusion framework that leverages AI to enhance satellite positioning accuracy by intelligently analyzing signal characteristics and incorporating dynamic hard example mining, demonstrating superior performance over traditional methods in urban environments.

Authors:Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu
Title: From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents
Abstract:
Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings. \footnote{https://github.com/quqxui/MemGAS}
中文:MemGAS框架通过构建多粒度记忆关联与自适应检索机制,有效解决了大语言模型在长对话记忆中的局限性,在多个基准测试中通过智能记忆整合和降噪实现了优于现有方法的性能表现。
English: The MemGAS framework addresses limitations in long-term dialogue memory for LLMs by implementing multi-granularity memory association and adaptive retrieval, outperforming existing methods across multiple benchmarks through intelligent memory consolidation and noise reduction.

Authors:Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, David Osowiechi, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers
Title: SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds
Abstract:
Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: https://github.com/AliBahri94/SMART-PC.
Chinese: SMART-PC提出了一种基于骨架的三维点云分类框架,通过仅更新BatchNorm统计量实现无需反向传播的实时自适应,在保持高帧率的同时获得了最优的分类精度。
English: SMART-PC introduces a skeleton-based framework for 3D point cloud classification that eliminates backpropagation during test-time adaptation, achieving real-time performance and superior accuracy by updating only BatchNorm statistics.

Authors:Yiyun Zhou, Zheqi Lv, Shengyu Zhang, Jingyuan Chen
Title: Cuff-KT: Tackling Learners' Real-time Learning Pattern Adjustment via Tuning-Free Knowledge State Guided Model Updating
Abstract:
Knowledge Tracing (KT) is a core component of Intelligent Tutoring Systems, modeling learners' knowledge state to predict future performance and provide personalized learning support. Traditional KT models assume that learners' learning abilities remain relatively stable over short periods or change in predictable ways based on prior performance. However, in reality, learners' abilities change irregularly due to factors like cognitive fatigue, motivation, and external stress -- a task introduced, which we refer to as Real-time Learning Pattern Adjustment (RLPA). Existing KT models, when faced with RLPA, lack sufficient adaptability, because they fail to timely account for the dynamic nature of different learners' evolving learning patterns. Current strategies for enhancing adaptability rely on retraining, which leads to significant overfitting and high time overhead issues. To address this, we propose Cuff-KT, comprising a controller and a generator. The controller assigns value scores to learners, while the generator generates personalized parameters for selected learners. Cuff-KT controllably adapts to data changes fast and flexibly without fine-tuning. Experiments on five datasets from different subjects demonstrate that Cuff-KT significantly improves the performance of five KT models with different structures under intra- and inter-learner shifts, with an average relative increase in AUC of 10% and 4%, respectively, at a negligible time cost, effectively tackling RLPA task. Our code and datasets are fully available at https://github.com/zyy-2001/Cuff-KT.
Chinese: 知识追踪模型通常假设学习能力稳定,但现实因素导致其不规则变化,提出的Cuff-KT通过动态适应学习者变化模式而无需重新训练,显著提升了多数据集上的性能表现。
English: Knowledge Tracing models traditionally assume stable learning abilities, but real-world factors cause irregular changes, which the proposed Cuff-KT addresses by dynamically adapting to learners' evolving patterns without retraining, significantly improving performance across datasets.

Authors:Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
Title: FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Abstract:
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
中文: 大型视觉语言模型因冗余视觉令牌导致计算效率低下,现有剪枝方法仅依赖单层注意力评分存在不足,因此提出FlowCut这一信息流感知框架,通过更贴合模型内在行为来提升性能与速度。
English: Large vision-language models face computational inefficiency from redundant visual tokens, which existing pruning methods inadequately address by relying on single-layer attention scores, prompting the development of FlowCut, an information-flow-aware framework that enhances performance and speed by better aligning with the model's inherent behaviors.

Authors:Yejin Lee, Joonghyuk Hahn, Hyeseon Ahn, Yo-Sub Han
Title: AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection
Abstract:
Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness. Our code is publicly available at: https://github.com/leeyejin1231/AmpleHate.
Chinese Summary: AmpleHate提出了一种新颖的隐式仇恨言论检测方法,通过模拟人类推理过程识别目标与上下文关系,在性能和收敛速度上均优于现有技术。
English Summary: AmpleHate introduces a novel method for detecting implicit hate speech by mimicking human reasoning, using target identification and context relationships to achieve superior performance and faster convergence compared to existing approaches.

Authors:Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo
Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
Abstract:
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
中文: 大语言模型在场景图理解方面表现出色,但在从复杂叙事生成场景图时存在困难,这凸显了该领域需要改进方法。
English: Large Language Models demonstrate strong scene graph understanding but face challenges in generating scene graphs from complex narratives, highlighting the need for improved methodologies in this area.

Authors:Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du
Title: Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models
Abstract:
Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.
中文: 大型多模态模型在检索增强生成中面临知识冲突问题,为此我们提出MMKC-Bench基准进行评估,发现现有模型倾向于依赖内部参数知识而非外部证据。
English: Large Multimodal Models struggle with knowledge conflicts in retrieval-augmented generation, so we introduce MMKC-Bench to evaluate these scenarios and find that models often prioritize internal knowledge over external evidence.

Authors:Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Title: DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Abstract:
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
中文: 本文提出防御性输出生成(DOGe)策略,通过微调教师大语言模型的最后一层,使其输出在保持对合法用户有效的同时,严重破坏基于知识蒸馏的模型模仿效果。
English: This paper introduces Defensive Output Generation (DOGe), an efficient strategy that fine-tunes the final layer of a teacher LLM to subtly alter its outputs, preserving utility for legitimate users while severely degrading performance in distillation-based imitation attempts.

Authors:Sanghyun Kim, Deunsol Jung, Minsu Cho
Title: Locality-Aware Zero-Shot Human-Object Interaction Detection
Abstract:
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions. To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations. The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches. The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object. By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs. Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
Chinese: LAIN是一种新颖的零样本人物-物体交互检测框架,通过增强CLIP表征的局部感知能力以捕捉细粒度细节和空间结构,以及交互感知能力以识别人与物体的互动模式,在多种基准测试中表现卓越。
English: LAIN is a novel zero-shot HOI detection framework that enhances CLIP's representations by incorporating locality awareness to capture fine-grained details and spatial structures, and interaction awareness to identify human-object interaction patterns, achieving superior performance on various benchmarks.

Authors:Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou
Title: Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
Abstract:
The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.
中文: 本研究提出了LinuxFLBench,一个针对Linux内核故障定位的基准测试,发现现有LLM代理表现不佳,并设计了LinuxFL⁺框架,以最小成本显著提升了其定位准确性。
English: The study introduces LinuxFLBench, a benchmark for fault localization in the Linux kernel, revealing that current LLM agents perform poorly and proposes LinuxFL⁺, a framework that significantly enhances their accuracy with minimal cost.

Authors:Xinmiao Hu, Chun Wang, Ruihe An, ChenYu Shao, Xiaojun Ye, Sheng Zhou, Liangcheng Li
Title: Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: https://github.com/IgniSavium/Causal-LLaVA
Chinese: 该因果驱动解耦框架通过视觉路径中的因果投影器和语言模型的干预模块,有效减少了多模态大语言模型因数据偏见导致的物体幻觉问题。
English: The proposed causality-driven disentanglement framework effectively reduces object hallucinations in Multimodal Large Language Models by mitigating spurious correlations from dataset biases through specialized visual and intervention modules.

Authors:Puyuan Peng, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
Title: VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
Abstract:
We present VoiceStar, the first zero-shot TTS model that achieves both output duration control and extrapolation. VoiceStar is an autoregressive encoder-decoder neural codec language model, that leverages a novel Progress-Monitoring Rotary Position Embedding (PM-RoPE) and is trained with Continuation-Prompt Mixed (CPM) training. PM-RoPE enables the model to better align text and speech tokens, indicates the target duration for the generated speech, and also allows the model to generate speech waveforms much longer in duration than those seen during. CPM training also helps to mitigate the training/inference mismatch, and significantly improves the quality of the generated speech in terms of speaker similarity and intelligibility. VoiceStar outperforms or is on par with current state-of-the-art models on short-form benchmarks such as Librispeech and Seed-TTS, and significantly outperforms these models on long-form/extrapolation benchmarks (20-50s) in terms of intelligibility and naturalness. Code and models: https://github.com/jasonppy/VoiceStar. Audio samples: https://jasonppy.github.io/VoiceStar_web
中文:VoiceStar是首个实现时长控制和超长语音生成的零样本TTS模型,通过PM-RoPE和CPM训练技术,在长语音合成基准测试中显著超越现有最优模型。
English: VoiceStar is the first zero-shot TTS model that achieves duration control and extrapolation through PM-RoPE and CPM training, outperforming state-of-the-art models in long-form speech generation.

Authors:Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu
Title: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Abstract:
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
中文: BizFinBench作为首个金融领域专用基准,通过6,781条标注中文查询和创新的IteraJudge评估方法,系统评估了25个模型在五大金融能力维度的表现,发现现有模型在跨概念推理等复杂场景仍存在明显不足。
English: BizFinBench is a specialized benchmark with 6,781 annotated Chinese queries to rigorously evaluate LLMs in financial applications, revealing significant performance gaps across tasks and introducing IteraJudge to reduce evaluation bias.

Authors:X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang
Title: CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
Abstract:
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.
中文摘要:CSTrack通过空间紧凑模块和时间紧凑模块整合RGB-X双模态特征流,实现了简化的计算结构和最优的跟踪性能,在主流基准测试中创下新纪录。
English Summary: CSTrack introduces compact spatiotemporal modeling through spatial and temporal modules that unify RGB-X feature streams, achieving state-of-the-art tracking performance with simplified computation.

Authors:Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
Title: Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
Abstract:
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.
中文: 训练后压缩可降低大语言模型的计算与内存成本,但现有基准仅关注语言建模而忽略智能体能力,因此ACBench作为首个全面评估压缩对智能体能力影响的基准,涵盖12项任务与多种压缩技术,实验表明4位量化能保持工作流生成与工具使用功能,但会降低实际应用准确性10%-15%。
English: Post-training compression enables efficient deployment of large language models but existing benchmarks overlook agentic capabilities, prompting the introduction of ACBench to comprehensively evaluate compression impacts on tasks like workflow generation and tool use, revealing tradeoffs such as preserved functionality in quantization but degraded real-world accuracy.

Authors:Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo
Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Abstract:
Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. LLMs offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, FIN-FORCE-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, FIN-FORCE supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on FIN-FORCE, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research. We release the benchmark, supplementary data and all experimental codes at the following link: https://github.com/keanepotato/fin_force
中文摘要:前瞻性反事实推理有助于预测金融市场未来动向,而FIN-FORCE基准通过评估大语言模型生成此类预测的能力,为自动化决策支持开辟了新途径。
English Summary: Forward counterfactual reasoning helps anticipate future financial market developments, and the FIN-FORCE benchmark enables automated evaluation of large language models for generating these insights to support decision-making.

Authors:Abhijnan Nath, Carine Graff, Andrei Bachinin, Nikhil Krishnaswamy
Title: Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Abstract:
AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.
中文摘要:摩擦代理对齐框架(FAAF)通过生成情境感知的“摩擦”来促进动态协作中的信念对齐,其解耦的双策略设计在可解释性和泛化能力上均优于现有方法。
English Summary: The Frictional Agent Alignment Framework (FAAF) addresses belief misalignment in dynamic AI collaboration by generating contextual friction that prompts evidence re-examination, outperforming existing methods in interpretability and generalization.

Authors:Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen
Title: WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
Abstract:
The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94\%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.
中文: WINA是一种无需训练的新型稀疏激活框架,通过联合考虑隐藏状态幅度和权重矩阵范数,在相同稀疏度下比现有方法性能提升最高达2.94%,为LLM推理设立了新的性能基准。
English: WINA is a novel training-free sparse activation framework that jointly considers hidden state magnitudes and weight matrix norms to achieve superior approximation accuracy and outperform existing methods by up to 2.94% across various LLMs and datasets.

Authors:Ethan TS. Liu, Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang
Title: VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
Abstract:
Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader
中文摘要:VADER是一个经人工评估的基准,专门用于测试大语言模型在软件漏洞处理四个关键维度(评估、检测、解释和修复)的表现,结果表明当前最先进模型仅取得中等成功率,仍有巨大改进空间。
English Summary: VADER is a human-evaluated benchmark designed to assess large language models' capabilities in handling software vulnerabilities across four dimensions—assessment, detection, explanation, and remediation—revealing that current state-of-the-art models achieve only moderate success with significant room for improvement.

Authors:Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Title: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Abstract:
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
Chinese: 本研究针对阿姆哈拉语开发了专门的密集检索模型,相比现有多语言基线实现了高达17.6%的性能提升,同时模型体积大幅缩小,并公开了全部数据集与代码以推动低资源信息检索研究。
English: This research introduces Amharic-specific dense retrieval models that significantly outperform existing multilingual baselines, achieving up to 17.6% improvement in retrieval metrics while being substantially more compact, with all resources made publicly available to advance low-resource information retrieval.

Authors:Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan
Title: Communication-Efficient Multi-Device Inference Acceleration for Transformer Models
Abstract:
Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.
中文:ASTRA是一种通信高效的框架,通过整合序列并行和混合精度注意力机制来加速Transformer推理,在低带宽下实现显著加速,同时通过量化优化保持任务准确性。
English: ASTRA is a communication-efficient framework that accelerates Transformer inference by integrating sequence parallelism and a Mixed-Precision Attention mechanism, achieving significant speedups under low bandwidth while maintaining accuracy through quantization optimizations.

Authors:Libo Wang
Title: Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP)
Abstract:
To address the gaps between the static pre-set "thinking-planning-action" of humanoid robots in unfamiliar scenarios and the highly programmed "call tool-return result" due to the lack of autonomous coding capabilities, this work designs a dynamic architecture connecting continuous thought machines (CTM) and model context protocol (MCP). It proposes a theoretical parallel solution through tick-slab and uses rank compression to achieve parameter suppression to provide a solution for achieving autonomous actions due to autonomous coding. The researcher used a simulation-based experiment using OpenAI's o4-mini-high as a tool to build the experimental environment, and introduced the extended SayCan dataset to conduct nine epochs of experiments. The experimental results show that the CTM-MCP architecture is feasible and effective through the data results of seven metrics: task success rate (TSR), execution success rate (ESR), average episode length (AEL), ROSCOE, REVEAL, proficiency self-assessment (PSA), task effectiveness (TE). In practice, it provides a reference experience for exploring the autonomous dynamic coding of humanoid robots based on continuous thinking to achieve human-like autonomous actions.
中文摘要:本研究提出了一种动态的CTM-MCP架构,通过理论模型和实验验证,使人形机器人能够实现自主编码和类人自主行动,并提升了多项性能指标。
English Summary: This study introduces a dynamic CTM-MCP architecture that enables humanoid robots to achieve autonomous coding and human-like actions through theoretical models and experimental validation with improved performance metrics.

Authors:Qiang Hu, Qimei Wang, Jia Chen, Xuantao Ji, Mei Liu, Qiang Li, Zhiwei Wang
Title: Holistic White-light Polyp Classification via Alignment-free Dense Distillation of Auxiliary Optical Chromoendoscopy
Abstract:
White Light Imaging (WLI) and Narrow Band Imaging (NBI) are the two main colonoscopic modalities for polyp classification. While NBI, as optical chromoendoscopy, offers valuable vascular details, WLI remains the most common and often the only available modality in resource-limited settings. However, WLI-based methods typically underperform, limiting their clinical applicability. Existing approaches transfer knowledge from NBI to WLI through global feature alignment but often rely on cropped lesion regions, which are susceptible to detection errors and neglect contextual and subtle diagnostic cues. To address this, this paper proposes a novel holistic classification framework that leverages full-image diagnosis without requiring polyp localization. The key innovation lies in the Alignment-free Dense Distillation (ADD) module, which enables fine-grained cross-domain knowledge distillation regardless of misalignment between WLI and NBI images. Without resorting to explicit image alignment, ADD learns pixel-wise cross-domain affinities to establish correspondences between feature maps, guiding the distillation along the most relevant pixel connections. To further enhance distillation reliability, ADD incorporates Class Activation Mapping (CAM) to filter cross-domain affinities, ensuring the distillation path connects only those semantically consistent regions with equal contributions to polyp diagnosis. Extensive results on public and in-house datasets show that our method achieves state-of-the-art performance, relatively outperforming the other approaches by at least 2.5% and 16.2% in AUC, respectively. Code is available at: https://github.com/Huster-Hq/ADD.
Chinese: 本文提出了一种新颖的整体分类框架,通过无对齐密集蒸馏模块实现从窄带成像到白光成像的细粒度知识迁移,无需息肉定位即可达到最先进的息肉分类性能。
English: This paper introduces a novel holistic classification framework with an Alignment-free Dense Distillation module that enables fine-grained knowledge transfer from Narrow Band Imaging to White Light Imaging for polyp classification, achieving state-of-the-art performance without requiring polyp localization.

Authors:Zirui Li, Siwei Wu, Xingyu Wang, Yi Zhou, Yizhi Li, Chenghua Lin
Title: DocMMIR: A Framework for Document Multi-modal Information Retrieval
Abstract:
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.
中文摘要:本研究提出了DocMMIR多模态文档检索框架和跨领域大规模基准数据集,通过针对性训练策略解决了现有方法的不足,相比零样本基线实现了31%的性能提升。
English Summary: The study introduces DocMMIR, a novel multi-modal document retrieval framework and a large-scale cross-domain benchmark, addressing limitations in existing methods and achieving a 31% performance improvement over zero-shot baselines through tailored training strategies.

Authors:Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Title: SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking
Abstract:
Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs' access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at https://github.com/jnanliu/SituatedThinker.
中文摘要:SituatedThinker框架通过强化学习将现实世界情境融入大型语言模型的推理过程,显著提升了多项基准测试和未知任务的表现。
English Summary: The SituatedThinker framework enhances large language models' reasoning by integrating real-world contexts through reinforcement learning, significantly improving performance on various benchmarks and unseen tasks.

Authors:Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Runchu Tian, Shaowen Wang, Giri Narasimhan, Jiawei Han
Title: Hypercube-Based Retrieval-Augmented Generation for Scientific Question-Answering
Abstract:
Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG) has shown its high promise, empowering LLMs to generate more qualified responses with retrieved external data and knowledge. However, most RAG methods retrieve relevant documents based on either sparse or dense retrieval methods or their combinations, which overlooks the essential, multi-dimensional, and structured semantic information present in documents. This structured information plays a critical role in finding concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). In this work, we introduce a multi-dimensional (cube) structure, Hypercube, which can index and allocate documents in a pre-defined multi-dimensional space. Built on the hypercube, we further propose Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities, phrases, and topics along with pre-defined hypercube dimensions, and then retrieves relevant documents from cubes by aligning these decomposed components with corresponding dimensions. Experiments on three datasets across different domains demonstrate that our method improves response accuracy by 3.7% and retrieval accuracy by 5.3% over the strongest RAG baseline. It also boosts retrieval efficiency (speed) by one or two magnitudes faster than graph-based RAG. Notably, our Hypercube-RAG inherently offers explainability by revealing those underlying dimensions used for retrieval. The code and data are available at https://github.com/JimengShi/Hypercube-RAG.
中文: Hypercube-RAG通过构建多维索引结构,将查询分解并与预设语义维度对齐进行文档检索,在多个领域数据集上实现了响应准确率3.7%和检索效率数量级级的显著提升。
English: Hypercube-RAG introduces a multi-dimensional indexing structure to enhance retrieval-augmented generation by decomposing queries and aligning them with structured semantic dimensions, achieving significant improvements in accuracy and efficiency over existing methods.

Authors:Vivek Gopalakrishnan, Neel Dey, Polina Golland
Title: PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms
Abstract:
Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.
中文摘要:PolyPose 提出了一种通过将三维变形建模为刚性变换组合的鲁棒二维/三维配准方法,仅需两张X光片即可实现精确的患者姿态估计,无需复杂正则化。
English Summary: PolyPose introduces a robust method for 2D/3D registration by modeling 3D deformations as compositions of rigid transforms, enabling accurate patient pose estimation from just two X-rays without complex regularization.

Authors:Vivek Gopalakrishnan, Neel Dey, Polina Golland
Title: PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations
Abstract:
Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise-rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-rays, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail. Additional visualizations, tutorials, and code are available at https://polypose.csail.mit.edu.
中文摘要:PolyPose 提出了一种通过将三维变形建模为刚性变换组合的鲁棒二维/三维配准方法,仅需两张X光片即可实现精确的患者姿态估计,无需复杂正则化。
English Summary: PolyPose introduces a robust method for 2D/3D registration by modeling 3D deformations as compositions of rigid transforms, enabling accurate patient pose estimation from just two X-rays without complex regularization.

Authors:Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Title: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Abstract:
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.
Chinese Summary: VTool-R1是首个通过整合文本与基于Python工具的视觉推理步骤来训练视觉语言模型生成多模态思维链的框架,无需过程监督即可显著提升视觉任务中的推理准确性。
English Summary: VTool-R1 is the first framework that trains vision-language models to produce multimodal chains of thought by integrating text with visual reasoning steps using Python-based tools, significantly improving reasoning accuracy on visual tasks without process supervision.

Authors:Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger
Title: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Abstract:
Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. While prior reviews have addressed these issues, they often focus on individual limitations or consider them within the broader context of evaluating overall model performance. This survey addresses the gap by presenting a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to 2025, using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we extract 14,648 relevant limitation papers using keyword filtering and LLM-based classification, validated against expert labels. Using topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM), we identify between 7 and 15 prominent types of limitations discussed in recent LLM research across the ACL and arXiv datasets. We find that LLM-related research increases nearly sixfold in ACL and nearly fifteenfold in arXiv between 2022 and 2025, while LLLMs research grows even faster, by a factor of over 12 in ACL and nearly 28 in arXiv. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2025. We offer a quantitative view of trends in LLM limitations research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.
中文摘要:本综述通过数据驱动方法分析了2022至2025年间大语言模型的局限性,发现推理能力是最受关注的研究短板,同时揭示了相关研究的加速增长态势及不同学术数据集中的研究主题演变。
English Summary: This survey provides a data-driven analysis of large language model limitations from 2022-2025, identifying reasoning as the most studied constraint while documenting accelerated research growth and evolving focus areas across academic datasets.

Authors:Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Title: CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Abstract:
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.
中文: 视觉语言模型存在高推理成本问题,本研究揭示了令牌与神经元稀疏性之间的相互增强关系,提出的CoreMatching框架通过协同优化显著提升了效率,在多项任务和硬件上实现了计算量削减和加速效果。
English: Vision-Language Models face high inference costs, and this study reveals a mutual reinforcement between token and neuron sparsity, leading to the CoreMatching framework that significantly boosts efficiency with demonstrated speedups and computational reductions.

Authors:Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan
Title: MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
Abstract:
Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.
Chinese: MedITok是首个专为医学影像设计的统一标记器,通过两阶段训练框架融合了视觉细节与临床语义,在多种诊断和生成任务中实现了顶尖性能。
English: MedITok introduces the first unified tokenizer for medical imaging that captures both fine-grained visual details and clinical semantics, enabling state-of-the-art performance across diverse diagnostic and generative tasks through a novel two-stage training framework.

Authors:Jingwei Wu, Zhewei Huang, Chang Liu
Title: Advancing Video Self-Supervised Learning via Image Foundation Models
Abstract:
In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.
Chinese: 提出的AdViSe方法利用预训练图像基础模型并添加时序模块,在实现先进视频自监督学习性能的同时,将训练时间减少3.4倍、GPU内存使用降低8.2倍。
English: The proposed AdViSe method leverages pre-trained image foundation models with added temporal modules to achieve state-of-the-art video self-supervised learning performance while drastically cutting training time by 3.4 times and GPU memory usage by 8.2 times.

Authors:Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Abstract:
Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.
中文: 该研究通过MoralSim评估大语言模型在道德规范与激励冲突的社会困境中的行为,发现模型存在显著行为差异且无一能保持一贯道德表现,突显了其在代理角色部署中的风险。
English: The study introduces MoralSim to assess how large language models navigate social dilemmas where ethical norms conflict with incentives, revealing significant behavioral inconsistencies and no model's consistently moral performance, underscoring risks in agentic deployments.

Authors:Tyler Ward, Aaron Moseley, Abdullah-Al-Zubaer Imran
Title: Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation
Abstract:
Segmentation is one of the most important tasks in the medical imaging pipeline as it influences a number of image-based decisions. To be effective, fully supervised segmentation approaches require large amounts of manually annotated training data. However, the pixel-level annotation process is expensive, time-consuming, and error-prone, hindering progress and making it challenging to perform effective segmentations. Therefore, models must learn efficiently from limited labeled data. Self-supervised learning (SSL), particularly contrastive learning via pre-training on unlabeled data and fine-tuning on limited annotations, can facilitate such limited labeled image segmentation. To this end, we propose a novel self-supervised contrastive learning framework for medical image segmentation, leveraging inherent relationships of different images, dubbed PolyCL. Without requiring any pixel-level annotations or unreasonable data augmentations, our PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate, in a task-related manner. Additionally, we integrate the Segment Anything Model (SAM) into our framework in two novel ways: as a post-processing refinement module that improves the accuracy of predicted masks using bounding box prompts derived from coarse outputs, and as a propagation mechanism via SAM 2 that generates volumetric segmentations from a single annotated 2D slice. Experimental evaluations on three public computed tomography (CT) datasets demonstrate that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios. Our code is available at https://github.com/tbwa233/PolyCL.
中文: 提出的PolyCL框架通过自监督对比学习在有限标注数据下实现有效的医学图像分割,结合SAM进行掩码优化和体积传播,并在CT数据集上展现出优越性能。
English: The proposed PolyCL framework uses self-supervised contrastive learning to achieve effective medical image segmentation with limited labeled data, integrating SAM for mask refinement and volumetric propagation while demonstrating superior performance on CT datasets.

Authors:Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Title: DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Abstract:
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
中文: DREAM为视觉语言模型提出了一种创新的推测解码框架,通过交叉注意力对齐、自适应特征选择和视觉标记压缩技术,将推理吞吐量最高提升3.6倍。
English: DREAM introduces a novel speculative decoding framework for vision-language models that enhances inference throughput up to 3.6x through cross-attention alignment, adaptive feature selection, and visual token compression.

Authors:Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
Title: I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Abstract:
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
中文:提出的I2MoE框架通过专门设计的交互专家显式建模多样化多模态交互,并在局部和全局层面提供可解释性,从而提升多模态融合效果,在多个数据集上验证了其性能优势。
English: The proposed I2MoE framework enhances multimodal fusion by explicitly modeling diverse interactions through specialized experts and providing interpretability at both local and global levels, demonstrating improved performance across various datasets.

Authors:Yaoyang Liu, Junlin Li, Yinjun Wu, Zhen Chen
Title: POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval
Abstract:
Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25.
中文: POQD提出了一种新颖的端到端可训练框架,利用大语言模型生成的提示优化多向量检索中的查询分解,从而高效提升检索和问答准确性。
English: POQD introduces a novel, end-to-end trainable framework that optimizes query decomposition for multi-vector retrieval using LLM-generated prompts, enhancing retrieval and QA accuracy efficiently.

Authors:Pradyumna Shyama Prasad, Minh Nhat Nguyen
Title: When Two LLMs Debate, Both Think They'll Win
Abstract:
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles. Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence
中文摘要:本研究表明大型语言模型在对抗性辩论中表现出系统性过度自信且无法恰当调整置信度,对其在动态多轮任务中的可靠性提出了重要质疑。
English Summary: This study reveals that large language models exhibit systematic overconfidence and fail to properly adjust their confidence in adversarial debates, raising concerns about their reliability in dynamic, multi-turn tasks.

Authors:Kefan Wang, Hao Wang, Wei Guo, Yong Liu, Jianghao Lin, Defu Lian, Enhong Chen
Title: DLF: Enhancing Explicit-Implicit Interaction via Dynamic Low-Order-Aware Fusion for CTR Prediction
Abstract:
Click-through rate (CTR) prediction is a critical task in online advertising and recommender systems, relying on effective modeling of feature interactions. Explicit interactions capture predefined relationships, such as inner products, but often suffer from data sparsity, while implicit interactions excel at learning complex patterns through non-linear transformations but lack inductive biases for efficient low-order modeling. Existing two-stream architectures integrate these paradigms but face challenges such as limited information sharing, gradient imbalance, and difficulty preserving low-order signals in sparse CTR data. We propose a novel framework, Dynamic Low-Order-Aware Fusion (DLF), which addresses these limitations through two key components: a Residual-Aware Low-Order Interaction Network (RLI) and a Network-Aware Attention Fusion Module (NAF). RLI explicitly preserves low-order signals while mitigating redundancy from residual connections, and NAF dynamically integrates explicit and implicit representations at each layer, enhancing information sharing and alleviating gradient imbalance. Together, these innovations balance low-order and high-order interactions, improving model expressiveness. Extensive experiments on public datasets demonstrate that DLF achieves state-of-the-art performance in CTR prediction, addressing key limitations of existing models. The implementation is publicly available at https://github.com/USTC-StarTeam/DLF.
中文摘要:提出的动态低阶感知融合(DLF)框架通过两个创新组件有效平衡了低阶与高阶特征交互,解决了现有双流架构的关键局限,在CTR预测任务中实现了最先进的性能。
English Summary: The proposed Dynamic Low-Order-Aware Fusion (DLF) framework effectively balances low-order and high-order feature interactions through two novel components, achieving state-of-the-art CTR prediction performance by addressing limitations in existing two-stream architectures.

Authors:Zhuo Liu, Moxin Li, Xun Deng, Qifan Wang, Fuli Feng
Title: Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Abstract:
LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at https://github.com/Liuz233/AGDe-Judge.
中文: LLM-as-a-Judge 利用大语言模型评估生成内容时存在教师偏好偏差,AGDe-Judge 通过引入无偏助理模型和三阶段去偏框架,在六个基准测试中有效降低偏差并保持性能。
English: LLM-as-a-Judge uses models like GPT-4 to assess LLM outputs efficiently but faces teacher preference bias, which AGDe-Judge addresses by incorporating an unbiased assistant model and a three-stage debiasing framework to maintain performance across benchmarks.

Authors:Shengdong Han, Shangdong Yang, Xin Zhang, Yuxuan Li, Xiang Li, Jian Yang, Ming-Ming Cheng, Yimian Dai
Title: DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
Abstract:
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at https://github.com/GrokCV/GrokCSO.
中文: 本研究提出了首个专为解混密集红外小目标设计的深度学习模型DISTA-Net,通过动态生成参数实现实时重构,并建立了包含基准数据集、评估指标和工具包的开源生态系统以推动该领域研究。
English: This study introduces DISTA-Net, the first deep learning model tailored for resolving closely-spaced infrared small targets by dynamically generating parameters for real-time reconstruction, and establishes an open-source ecosystem including a benchmark dataset, evaluation metric, and toolkit to advance research in this field.

Authors:Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Abstract:
The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.
中文: AI效率研究的重点正从模型为中心的扩展转向数据为中心的令牌压缩,以解决长序列在大语言模型中引起的计算瓶颈问题。
English: The focus of AI efficiency research is shifting from model-centric scaling to data-centric token compression to overcome computational bottlenecks caused by long sequences in large language models.

Authors:Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen
Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning
Abstract:
The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.
中文总结:MMATH基准测试揭示了先进模型在多语言复杂推理中存在显著性能差距和语言偏离问题,而采用英语推理结合目标语言回答的策略可有效提升表现并保持语言一致性。
English Summary: The MMATH benchmark reveals significant performance gaps and off-target language issues in multilingual complex reasoning by advanced models, with strategies like reasoning in English and answering in target languages proving effective for improvement.

Authors:Xiaoyang Liu, Bolin Qiu, Jiezhang Cao, Zheng Chen, Yulun Zhang, Xiaokang Yang
Title: Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition
Abstract:
Image demoiréing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moiré patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoiréing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moiré patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moiré-sensitive regions without incurring high computational cost. Extensive experiments on various demoiréing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at https://github.com/xyLiu339/Freqformer.
Chinese: Freqformer提出了一种基于Transformer的图像去摩尔纹框架,通过将摩尔纹有效分解为高频纹理和低频颜色失真,并采用双分支结构和专用模块,以紧凑的模型实现了最先进的性能。
English: Freqformer introduces a Transformer-based framework for image demoiréing by effectively separating moiré patterns into high-frequency textures and low-frequency color distortions, utilizing a dual-branch architecture and specialized modules to achieve state-of-the-art performance with a compact model.

Authors:Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry
Title: FP4 All the Way: Fully Quantized Training of LLMs
Abstract:
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .
中文摘要:本研究首次实现了使用4位浮点精度对权重、激活值和梯度进行全量化训练的大型语言模型,在保持与标准BF16基准相当性能的同时,证明了该方法在大规模训练中的实用高效性。
English Summary: This study introduces the first fully quantized training of large language models using 4-bit floating-point precision for weights, activations, and gradients, achieving performance comparable to standard BF16 baselines while demonstrating practical efficiency for large-scale training.

Authors:Zheng Chu, Huiming Fan, Jingchang Chen, Qianyu Wang, Mingda Yang, Jiafeng Liang, Zhongjie Wang, Hao Li, Guo Tang, Ming Liu, Bing Qin
Title: Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering
Abstract:
Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: https://github.com/zchuz/SiGIR-MHQA.
Chinese: 提出的SiGIR方法通过自我批判反馈指导迭代问题分解和路径选择,在多跳推理任务中实现了比先前最优方法8.6%的性能提升。
English: The proposed SiGIR method enhances multi-hop reasoning by using self-critique feedback to guide iterative question decomposition and trajectory selection, achieving an 8.6% improvement over previous state-of-the-art approaches.

Authors:Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng
Title: SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards
Abstract:
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ($\textbf{S}patially$ $\textbf{A}nchored$ $\textbf{T}ask$ $\textbf{O}ptimization$ with $\textbf{R}e\textbf{I}nforcement$ Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7\%$ improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.
中文: 本文提出的SATORI方法将视觉问答分解为可验证的阶段,通过明确的奖励信号增强关键区域关注,实现了最高15.7%的准确率提升。
English: The paper introduces SATORI, a method that enhances Visual Question Answering by decomposing it into verifiable stages with explicit rewards, improving accuracy by up to 15.7% and focusing on critical regions.

Authors:Benjamin Clavié, Florian Brand
Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Abstract:
Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .
中文摘要:ReadBench是一个专门评估大型视觉语言模型在文本密集图像上阅读理解能力的新基准,发现模型在处理长篇内容时表现显著下降,而文本分辨率影响甚微。
English Summary: ReadBench is a new benchmark designed to assess how well Large Vision-Language Models read and reason about text-rich images, revealing significant performance drops with longer textual content despite minimal impact from text resolution.

Authors:Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Title: Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Abstract:
Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.
中文: 本文提出Jodi框架,通过联合建模图像与多标签域的统一扩散方法,在生成与理解任务中表现优异,并展现出对广泛视觉领域的强大扩展能力。
English: This paper introduces Jodi, a diffusion framework that unifies visual generation and understanding through joint modeling of images and multiple labels, demonstrating strong performance across three task types and extensibility to diverse visual domains.

Authors:Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
Title: Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
Abstract:
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms existing baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR
中文: Universal Reasoner (UniR) 是一种轻量级即插即用推理模块,可与任何冻结的大型语言模型结合,无需重新训练即可增强其专业推理能力,在多项任务中超越现有方法,实现了高效、自适应的推理增强。
English: The Universal Reasoner (UniR) is a lightweight, plug-and-play reasoning module that can be added to any frozen large language model to enhance its specialized reasoning capabilities without retraining, outperforming existing methods and enabling cost-efficient, adaptable reasoning across tasks.

Authors:Jiashuo Chang, Zhengyi Li, Jianxun Lou, Zhen Qiu, Hanhe Lin
Title: MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database
Abstract:
Macro photography (MP) is a specialized field of photography that captures objects at an extremely close range, revealing tiny details. Although an accurate macro photography image quality assessment (MPIQA) metric can benefit macro photograph capturing, which is vital in some domains such as scientific research and medical applications, the lack of MPIQA data limits the development of MPIQA metrics. To address this limitation, we conducted a large-scale MPIQA study. Specifically, to ensure diversity both in content and quality, we sampled 2,000 MP images from 15,700 MP images, collected from three public image websites. For each MP image, 17 (out of 21 after outlier removal) quality ratings and a detailed quality report of distortion magnitudes, types, and positions are gathered by a lab study. The images, quality ratings, and quality reports form our novel multi-labeled MPIQA database, MMP-2k. Experimental results showed that the state-of-the-art generic IQA metrics underperform on MP images. The database and supplementary materials are available at https://github.com/Future-IQA/MMP-2k.
中文: 本研究通过大规模实验建立了MMP-2k多标签宏观摄影图像质量评估数据库,发现现有通用图像质量评估指标在宏观摄影图像上表现不佳。
English: This study introduces the MMP-2k database, a multi-labeled macro photography image quality assessment resource developed through a large-scale lab study, revealing that current generic IQA metrics perform poorly on macro images.

Authors:Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee
Title: Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Abstract:
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.
中文: 我们推出Speech-IFeval评估框架,旨在检测语音增强语言模型的指令遵循能力并量化其灾难性遗忘问题,发现这些模型在执行基本指令时表现远逊于纯文本模型且对提示变化极为敏感。
English: We present Speech-IFeval, a framework that evaluates instruction-following abilities and measures catastrophic forgetting in speech-aware language models, revealing their significant struggles with basic instructions and sensitivity to prompts compared to text-based models.

Authors:Minzhi Lin, Tianchi Xie, Mengchen Liu, Yilin Ye, Changjian Chen, Shixia Liu
Title: InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Abstract:
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
中文摘要:InfoChartQA是一个评估多模态大语言模型理解信息图表能力的新基准,通过对比分析配对的信息图表和普通图表,揭示了模型在设计元素特别是视觉隐喻问题上的显著性能差距。
English Summary: InfoChartQA is a new benchmark designed to evaluate multimodal large language models' ability to understand infographic charts with design elements, revealing significant performance gaps especially on visual metaphor questions through comparative analysis of paired infographic and plain charts.

Authors:Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan
Title: Can Multimodal Large Language Models Understand Spatial Relations?
Abstract:
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.
Chinese: 该研究提出了SpatialMQA,这是一个基于COCO2017的人工标注空间关系推理基准,旨在解决现有基准依赖边界框和忽视图像理解的问题,实验表明当前最先进的多模态大语言模型准确率仅为48.14%,远低于人类98.40%的水平。
English: The study introduces SpatialMQA, a human-annotated benchmark to improve spatial relation reasoning in multimodal large language models by addressing issues like reliance on bounding boxes and lack of image understanding, revealing that current models perform poorly with only 48.14% accuracy compared to human-level performance.

Authors:Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Abstract:
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
Chinese Summary: 强化学习应用于视频大语言模型虽前景广阔,但存在数据瓶颈和不稳定性问题,提出的VerIPO方法通过验证器引导的迭代优化,有效提升推理链质量并实现高效训练。
English Summary: Reinforcement Learning applied to Video Large Language Models shows promise but faces data bottlenecks and instability, which the proposed VerIPO method overcomes by using a verifier-guided iterative optimization to enhance reasoning chain quality efficiently.

Authors:Hewen Xiao, Xiuping Liu, Hang Zhao, Jian Liu, Kai Xu
Title: Designing Pin-pression Gripper and Learning its Dexterous Grasping with Online In-hand Adjustment
Abstract:
We introduce a novel design of parallel-jaw grippers drawing inspiration from pin-pression toys. The proposed pin-pression gripper features a distinctive mechanism in which each finger integrates a 2D array of pins capable of independent extension and retraction. This unique design allows the gripper to instantaneously customize its finger's shape to conform to the object being grasped by dynamically adjusting the extension/retraction of the pins. In addition, the gripper excels in in-hand re-orientation of objects for enhanced grasping stability again via dynamically adjusting the pins. To learn the dynamic grasping skills of pin-pression grippers, we devise a dedicated reinforcement learning algorithm with careful designs of state representation and reward shaping. To achieve a more efficient grasp-while-lift grasping mode, we propose a curriculum learning scheme. Extensive evaluations demonstrate that our design, together with the learned skills, leads to highly flexible and robust grasping with much stronger generality to unseen objects than alternatives. We also highlight encouraging physical results of sim-to-real transfer on a physically manufactured pin-pression gripper, demonstrating the practical significance of our novel gripper design and grasping skill. Demonstration videos for this paper are available at https://github.com/siggraph-pin-pression-gripper/pin-pression-gripper-video.
中文: 本文介绍了一种采用二维独立伸缩针阵列的压针式夹爪,能动态贴合物体实现灵巧抓取与掌内重定向,并通过专门设计的强化学习算法显著提升了对未见物体的抓取泛化能力与稳健性。
English: This paper presents a pin-pression gripper with a 2D array of independently movable pins that dynamically conform to objects for flexible grasping and in-hand reorientation, enhanced by a tailored reinforcement learning algorithm for robust performance across diverse objects.

Authors:Tianchen Deng, Wenhua Wu, Junjie He, Yue Pan, Xirui Jiang, Shenghai Yuan, Danwei Wang, Hesheng Wang, Weidong Chen
Title: VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes
Abstract:
3D Gaussian Splatting has recently shown promising results in dense visual SLAM. However, existing 3DGS-based SLAM methods are all constrained to small-room scenarios and struggle with memory explosion in large-scale scenes and long sequences. To this end, we propose VPGS-SLAM, the first 3DGS-based large-scale RGBD SLAM framework for both indoor and outdoor scenarios. We design a novel voxel-based progressive 3D Gaussian mapping method with multiple submaps for compact and accurate scene representation in large-scale and long-sequence scenes. This allows us to scale up to arbitrary scenes and improves robustness (even under pose drifts). In addition, we propose a 2D-3D fusion camera tracking method to achieve robust and accurate camera tracking in both indoor and outdoor large-scale scenes. Furthermore, we design a 2D-3D Gaussian loop closure method to eliminate pose drift. We further propose a submap fusion method with online distillation to achieve global consistency in large-scale scenes when detecting a loop. Experiments on various indoor and outdoor datasets demonstrate the superiority and generalizability of the proposed framework. The code will be open source on https://github.com/dtc111111/vpgs-slam.
中文: VPGS-SLAM是首个基于3D高斯泼溅的大规模RGBD SLAM框架,通过体素渐进式建图和2D-3D融合跟踪技术,有效解决了先前方法在大型场景中的内存爆炸问题,实现了室内外场景的鲁棒定位与建图。
English: VPGS-SLAM is the first 3D Gaussian Splatting-based RGBD SLAM framework that scales to large indoor and outdoor scenes through voxel-progressive mapping and 2D-3D fusion tracking, overcoming prior limitations of memory explosion and small-scale constraints.

Authors:Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Title: STRICT: Stress Test of Rendering Images Containing Text
Abstract:
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
中文:扩散模型在生成逼真图像方面表现出色,但在图像中生成一致且清晰文本方面存在不足,因此我们开发了STRICT基准来系统评估文本长度、正确性和指令遵循能力,揭示了持续存在的局限并推动了未来研究方向。
English: Diffusion models excel in realistic image generation but fail to produce consistent and legible text, leading to the creation of the STRICT benchmark for systematic evaluation across text length, correctness, and instruction adherence, revealing persistent limitations and motivating future research.

Authors:Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma
Title: GraSS: Scalable Influence Function with Sparse Gradient Compression
Abstract:
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
Chinese: GraSS及其变体FactGraSS通过利用逐样本梯度的稀疏性,在保持数据影响力准确性的同时显著降低了计算和内存成本,实现了大规模模型训练效率的显著提升。
English: GraSS and its variant FactGraSS are gradient compression algorithms that exploit the sparsity of per-sample gradients to reduce computational and memory costs while maintaining data influence accuracy, achieving significant speed improvements in large-scale models.

Authors:Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong
Title: CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation
Abstract:
Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.
中文: 本研究提出CDPDNet新型分割网络,通过融合视觉特征与文本嵌入解决医学影像中的部分标注问题,并在多数据集实验中展现出优于现有方法的性能。
English: The study introduces CDPDNet, a novel segmentation network that integrates vision and text embeddings to address incomplete labeling and anatomical relationship modeling in medical imaging, demonstrating superior performance over existing methods.

Authors:Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao
Title: How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation
Abstract:
LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at .
中文: 提出的图像辅助激光雷达(IAL)框架通过同步数据增强和创新融合模块,有效结合激光雷达与相机数据,解决了多模态错位问题并优化特征整合,从而在三维全景分割中实现了最先进的性能。
English: The proposed Image-Assists-LiDAR (IAL) framework introduces synchronized data augmentation and novel fusion modules to effectively combine LiDAR and camera data, achieving state-of-the-art 3D panoptic segmentation by addressing misalignment and enhancing feature integration.

Authors:Saman Sarker Joy
Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Abstract:
The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
中文: 本文提出了BnMMLU,这是一个涵盖23个领域、包含138,949个问题选项对的孟加拉语综合基准,揭示了语言模型存在的显著性能差距,并强调需要针对孟加拉语数据改进相关策略。
English: The paper introduces BnMMLU, a comprehensive Bengali benchmark spanning 23 domains with 138,949 question-option pairs, revealing significant performance gaps in language models and highlighting the need for improved strategies tailored to Bengali data.

Authors:Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
Abstract:
Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.
中文摘要:MetaMind是一个多智能体框架,通过模拟人类思维的三阶段协作推理来增强人工智能的社交智能,在心理理论任务中实现显著提升并首次达到人类水平表现。
English Summary: MetaMind is a multi-agent framework that enhances AI social intelligence by simulating human-like reasoning through three collaborative stages, achieving significant improvements in Theory of Mind tasks and matching human performance for the first time.

Authors:Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
Abstract:
Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses about user mental states (e.g., intent, emotion), (2) a Moral Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.
中文摘要:MetaMind是一个多智能体框架,通过模拟人类思维的三阶段协作推理来增强人工智能的社交智能,在心理理论任务中实现显著提升并首次达到人类水平表现。
English Summary: MetaMind is a multi-agent framework that enhances AI social intelligence by simulating human-like reasoning through three collaborative stages, achieving significant improvements in Theory of Mind tasks and matching human performance for the first time.

Authors:Xiping Li, Xiangyu Dong, Xingyi Zhang, Kun Xie, Yuanhao Feng, Bo Wang, Guilin Li, Wuxiong Zeng, Xiujun Shu, Sibo Wang
Title: Chi-Square Wavelet Graph Neural Networks for Heterogeneous Graph Anomaly Detection
Abstract:
Graph Anomaly Detection (GAD) in heterogeneous networks presents unique challenges due to node and edge heterogeneity. Existing Graph Neural Network (GNN) methods primarily focus on homogeneous GAD and thus fail to address three key issues: (C1) Capturing abnormal signal and rich semantics across diverse meta-paths; (C2) Retaining high-frequency content in HIN dimension alignment; and (C3) Learning effectively from difficult anomaly samples with class imbalance. To overcome these, we propose ChiGAD, a spectral GNN framework based on a novel Chi-Square filter, inspired by the wavelet effectiveness in diverse domains. Specifically, ChiGAD consists of: (1) Multi-Graph Chi-Square Filter, which captures anomalous information via applying dedicated Chi-Square filters to each meta-path graph; (2) Interactive Meta-Graph Convolution, which aligns features while preserving high-frequency information and incorporates heterogeneous messages by a unified Chi-Square Filter; and (3) Contribution-Informed Cross-Entropy Loss, which prioritizes difficult anomalies to address class imbalance. Extensive experiments on public and industrial datasets show that ChiGAD outperforms state-of-the-art models on multiple metrics. Additionally, its homogeneous variant, ChiGNN, excels on seven GAD datasets, validating the effectiveness of Chi-Square filters. Our code is available at https://github.com/HsipingLi/ChiGAD.
Chinese: 提出的ChiGAD框架通过采用基于卡方检验的谱图滤波器,能够在异构网络中跨元路径捕捉异常信号并保留高频信息,同时通过贡献感知的交叉熵损失解决类别不平衡问题,实验证明其性能优于现有先进模型。
English: The proposed ChiGAD framework addresses key challenges in heterogeneous graph anomaly detection by employing spectral Chi-Square filters to capture anomalies across meta-paths while preserving high-frequency information and handling class imbalance through specialized loss functions, demonstrating superior performance over existing methods.

Authors:Javier Salazar Cavazos, Jeffrey A Fessler, Laura Balzano
Title: ALPCAHUS: Subspace Clustering for Heteroscedastic Data
Abstract:
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that come from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-focused subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.
中文: 本文提出了ALPCAHUS异方差子空间聚类方法,通过估计样本级噪声方差来改进子空间基估计,在处理混合质量数据方面展现出优于现有算法的性能。
English: This paper introduces ALPCAHUS, a heteroscedastic subspace clustering method that estimates sample-wise noise variances to enhance subspace basis estimation, demonstrating superior performance over existing algorithms in handling data with mixed quality.

Authors:Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille
Title: Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Abstract:
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.
中文: DeepTumorVQA作为诊断基准,旨在系统评估视觉语言模型在三维临床诊断中的表现,发现现有模型在测量任务中表现良好,但在病灶识别和推理方面存在不足,且大规模多模态预训练对性能提升至关重要。
English: DeepTumorVQA is introduced as a diagnostic benchmark to evaluate Vision-Language Models' capabilities in 3D clinical diagnosis, revealing that current models excel in measurement tasks but fall short in lesion recognition and reasoning, with large-scale multimodal pretraining being crucial for performance.

Authors:Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Title: Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Abstract:
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.
中文:SVG2提出了一种无需训练的框架,通过语义感知的令牌聚类和重排序,提升关键令牌识别精度并优化计算效率,在保持高质量视频生成的同时实现了显著加速。
English: SVG2 introduces a training-free framework with semantic-aware token clustering and reordering to enhance critical token identification and computational efficiency, achieving significant speedups while maintaining high video generation quality.

Authors:Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov
Title: Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Abstract:
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.
中文摘要:Smoothie是一种创新的扩散模型,通过基于语义相似度逐步平滑词嵌入,在序列到序列生成任务中超越了现有扩散模型,展现出更优的生成质量。
English Summary: Smoothie is a novel diffusion model that enhances text generation by progressively smoothing token embeddings based on semantic similarity, achieving superior performance in sequence-to-sequence tasks compared to existing diffusion methods.

Authors:Kai Mei, Xi Zhu, Hang Gao, Shuhang Lin, Yongfeng Zhang
Title: LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
Abstract:
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.
中文: AIOS 1.0通过构建可被语言模型原生理解的上下文环境,采用模型上下文协议解决语义鸿沟问题,其轻量级代理LiteCUA在OSWorld基准测试中取得14.66%的成功率,展示了环境情境化对提升计算机使用代理能力的有效性。
English: AIOS 1.0 introduces a novel platform that addresses the semantic gap between language models and computer interfaces by creating contextual environments through a Model Context Protocol, enabling more effective agent reasoning and achieving a 14.66% success rate on OSWorld with its LiteCUA agent.

Authors:Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden
Title: How to build a consistency model: Learning flow maps via self-distillation
Abstract:
Flow-based generative models achieve state-of-the-art sample quality, but require the expensive solution of a differential equation at inference time. Flow map models, commonly known as consistency models, encompass many recent efforts to improve inference-time efficiency by learning the solution operator of this differential equation. Yet despite their promise, these models lack a unified description that clearly explains how to learn them efficiently in practice. Here, building on the methodology proposed in Boffi et. al. (2024), we present a systematic algorithmic framework for directly learning the flow map associated with a flow or diffusion model. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert any distillation scheme into a direct training algorithm via self-distillation, eliminating the need for pre-trained teachers. We introduce three algorithmic families based on different mathematical characterizations of the flow map: Eulerian, Lagrangian, and Progressive methods, which we show encompass and extend all known distillation and direct training schemes for consistency models. We find that the novel class of Lagrangian methods, which avoid both spatial derivatives and bootstrapping from small steps by design, achieve significantly more stable training and higher performance than more standard Eulerian and Progressive schemes. Our methodology unifies existing training schemes under a single common framework and reveals new design principles for accelerated generative modeling. Associated code is available at https://github.com/nmboffi/flow-maps.
Chinese: 该研究提出了一个系统性框架,用于直接学习生成模型中的流映射,无需预训练教师模型,并揭示拉格朗日方法相比传统方法具有更优的训练稳定性和性能表现。
English: The study introduces a systematic framework for directly learning flow maps in generative models, eliminating the need for pre-trained teachers and revealing that Lagrangian methods achieve superior stability and performance compared to traditional approaches.

Authors:Libin Lan, Yanxin Li, Xiaojuan Liu, Juan Zhou, Jianxun Zhang, Nannan Huang, Yudong Zhang
Title: MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation
Abstract:
Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.
中文: 提出的MSLAU-Net混合架构通过整合多尺度线性注意力机制和轻量级自上而下特征聚合策略,有效克服了CNN与Transformer模型的各自局限,在多种医学影像基准测试中均实现了最优性能。
English: The proposed MSLAU-Net hybrid architecture effectively overcomes the limitations of CNN and Transformer models by integrating multi-scale linear attention for efficient global context modeling and a lightweight top-down aggregation mechanism, achieving superior performance across multiple medical imaging benchmarks.

Authors:Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao
Title: VORTA: Efficient Video Diffusion via Routing Sparse Attention
Abstract:
Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.
Chinese: VORTA通过引入稀疏注意力机制和自适应路由策略,在保持视频生成质量的同时,将视频扩散变压器的处理速度最高提升14.41倍。
English: VORTA accelerates Video Diffusion Transformers by introducing a sparse attention mechanism and adaptive routing strategy, achieving up to 14.41× speedup without compromising video generation quality.

Authors:Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Abstract:
Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the Attention Localization and Pruning Strategy (ALPS), an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only 10% of attention parameters during fine-tuning while achieving a 2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment. The code is available at https://github.com/VoiceBeer/ALPS.
中文摘要:本研究提出的注意力定位与剪枝策略(ALPS)通过精确定位大语言模型中任务敏感度最高的注意力头并仅对这些头部进行微调,在激活10%参数的情况下实现性能提升2%,同时发现这些任务特定头部具有跨数据集可迁移性。
English Summary: The proposed Attention Localization and Pruning Strategy (ALPS) efficiently identifies and fine-tunes only the most task-sensitive attention heads in large language models, reducing alignment costs by activating just 10% of parameters while improving performance by 2% across three tasks.

Authors:David K. Zhang, Alex Aiken
Title: Automatic Verification of Floating-Point Accumulation Networks
Abstract:
Floating-point accumulation networks (FPANs) are key building blocks used in many floating-point algorithms, including compensated summation and double-double arithmetic. FPANs are notoriously difficult to analyze, and algorithms using FPANs are often published without rigorous correctness proofs. In fact, on at least one occasion, a published error bound for a widely used FPAN was later found to be incorrect. In this paper, we present an automatic procedure that produces computer-verified proofs of several FPAN correctness properties, including error bounds that are tight to the nearest bit. Our approach is underpinned by a novel floating-point abstraction that models the sign, exponent, and number of leading and trailing zeros and ones in the mantissa of each number flowing through an FPAN. We also present a new FPAN for double-double addition that is faster and more accurate than the previous best known algorithm.
Chinese: 本文提出了一种自动生成浮点累加网络(FPAN)正确性计算机验证证明的方法,包括精确的误差界限,并介绍了一种更高效、更精确的新型双精度加法FPAN。
English: This paper introduces an automated method for generating computer-verified proofs of floating-point accumulation network (FPAN) correctness, including precise error bounds, and presents a new, more efficient and accurate double-double addition FPAN.

Authors:Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Title: Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation
Abstract:
Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (\texttt{T$^2$A}), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference. Code is available \href{https://github.com/HongHanh2104/T2A-Think-Twice-Before-Adaptation}{here}.
中文摘要:提出的“三思而后适应”(T²A)方法通过不确定性感知的负向学习和优先样本适配策略,有效提升了深度伪造检测器在推理过程中对后处理操作和分布偏移的抵御能力。
English Summary: The proposed Think Twice before Adaptation (T²A) method enhances deepfake detector resilience against postprocessing manipulations and distribution shifts through uncertainty-aware negative learning and prioritized sample adaptation during inference.

Authors:Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
Title: OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks
Abstract:
Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their performance.Code and data are available at https://github.com/emilia113/OmniGenBench.
中文: 摘要介绍了OmniGenBench,这是一个全面的基准测试,旨在通过多样化的现实任务和双模式评估协议,评估大型多模态模型在感知和认知维度上的指令遵循能力。
English: The abstract introduces OmniGenBench, a comprehensive benchmark designed to evaluate the instruction-following capabilities of large multimodal models across perception and cognition dimensions, using diverse real-world tasks and a dual-mode assessment protocol.

Authors:Alexander Conzelmann, Robert Bamler
Title: Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding
Abstract:
The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40\% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at https://github.com/Conzel/cerwu.
Chinese: 本研究提出了一种新颖的训练后压缩框架,将速率感知量化与熵编码相结合,在多种计算机视觉网络中实现了20-40%的比特率降低,同时保持与NNCodec相当的性能。
English: This study introduces a novel post-training compression framework that integrates rate-aware quantization with entropy coding, achieving a 20-40% reduction in bit rate while maintaining performance comparable to NNCodec across various computer-vision networks.

Authors:Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang
Title: $C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
Abstract:
Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/TencentHunyuan/C3-Benchmark.
中文: 本文提出开源基准C³-Bench,通过复杂工具关系、关键隐藏信息和动态决策路径三大挑战评估AI智能体鲁棒性,揭示了现有模型在处理工具依赖和长上下文信息等方面的显著缺陷。
English: This paper introduces C³-Bench, an open-source benchmark designed to evaluate AI agents' robustness by challenging them with complex tool relationships, hidden information, and dynamic decision paths, revealing significant vulnerabilities in current models.

Authors:Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng
Title: LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning
Abstract:
Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.
中文: LogicCat是首个专为复杂推理设计的Text-to-SQL基准数据集,涵盖物理、数学、常识和假设推理场景,将当前最先进模型的执行准确率压制至33.20%,显著提升了实际应用中的推理挑战。
English: LogicCat is the first Text-to-SQL benchmark dataset specifically designed to address complex reasoning scenarios—including physics, arithmetic, commonsense, and hypothetical reasoning—significantly challenging current state-of-the-art models with execution accuracy as low as 33.20%.

Authors:Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang
Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation
Abstract:
Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.
中文: 近期文本生成图像模型进展显著,但现有基准主要关注图像与提示的显式对齐,忽略了超越提示的现实世界知识对齐,因此开发了ABP基准和ABPScore指标,揭示了当前模型的局限性,并提出无需训练的策略使性能提升约43%。
English: Recent text-to-image models have advanced significantly, but existing benchmarks overlook alignment with real-world knowledge beyond prompts, leading to the development of the ABP benchmark and ABPScore metric, which reveal limitations in current models and propose a training-free strategy that improves performance by 43%.

Authors:Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Title: Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Abstract:
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
中文摘要:直接偏好优化(DPO)因对所有令牌同等处理而存在局限,因此提出的OTPO方法通过基于最优传输的令牌加权机制,重点优化语义显著的令牌对,有效提升了偏好优化的性能。
English Summary: Direct Preference Optimization (DPO) faces limitations by treating all tokens equally, so the proposed OTPO method introduces optimal transport-based token weighting to emphasize meaningful tokens and improve preference optimization effectiveness.

Authors:Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
Title: Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
Abstract:
Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.
中文摘要:该研究提出的神经参数搜索(NPS-Pruning)方法通过利用任务向量子空间有效压缩微调模型,在保持视觉、自然语言处理和多模态任务性能的同时,显著提升了知识迁移、模型融合和存储效率。
English Summary: The proposed Neural Parameter Search (NPS-Pruning) method effectively compresses fine-tuned models by leveraging task vector subspaces, enhancing knowledge transfer, model merging, and storage efficiency while maintaining performance across vision, NLP, and multi-modal tasks.

Authors:Xu Zhang, Kun Zhang, Wenxin Ma, Rongsheng Wang, Chenxu Wu, Yingtai Li, S. Kevin Zhou
Title: A General Knowledge Injection Framework for ICD Coding
Abstract:
ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at https://github.com/xuzhang0112/GKI-ICD.
中文摘要:GKI-ICD框架无需专门模块即可整合三种ICD知识类型,通过全面实验在多数评估指标上实现了最优性能。
English Summary: The GKI-ICD framework effectively integrates three types of ICD knowledge without specialized modules, achieving state-of-the-art performance on most metrics through comprehensive experiments.

Authors:Jiabin Tang, Lianghao Xia, Zhonghang Li, Chao Huang
Title: AI-Researcher: Autonomous Scientific Innovation
Abstract:
The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation--with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.
中文: 大语言模型凭借强大的数学推理与自动化能力,催生了AI-Researcher这一全自主科研系统,它能完整执行从文献调研到论文撰写的科研流程,并通过实验证明可产出接近人类水平的研究成果,为自主科研奠定新基础。
English: Large Language Models' advanced reasoning and autonomous task automation enable AI-Researcher, a fully autonomous system that streamlines the scientific research pipeline and produces human-quality papers, setting new foundations for AI-driven discovery.

Authors:Chun Wang, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song
Title: GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
Abstract:
Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.
中文: GRE套件通过结构化推理链增强视觉语言模型,实现精准可解释的地理定位,其数据集、模型和基准测试在所有粒度上均优于现有方法。
English: The GRE Suite enhances Visual Language Models with structured reasoning chains for precise and interpretable geo-localization, introducing a dataset, model, and benchmark that outperform existing methods across all granularities.

Authors:Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano
Title: MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Abstract:
Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Θ(N\sqrt{N} d)$ computational complexity and $Θ(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.
中文: 本文提出MonarchAttention方法,通过使用表现力强的Monarch矩阵实现次二次注意力近似,在降低计算复杂度的同时保持性能与硬件效率,适用于多种任务场景。
English: This paper introduces MonarchAttention, a sub-quadratic attention approximation method using expressive Monarch matrices that reduces computational complexity while maintaining performance and hardware efficiency across various tasks.

Authors:Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Jeffrey Xu Yu
Title: Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study
Abstract:
Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
Chinese: 本研究探讨大型语言模型(LLMs)能否缓解图持续学习中的灾难性遗忘,提出了一种简单而有效的SimGCL方法,显著超越先前基于图神经网络的方法,并推出了便于未来研究使用的基准测试平台。
English: This study explores whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL), proposing a simple-yet-effective method called SimGCL that significantly outperforms previous GNN-based approaches and introducing an easy-to-use benchmark for future research.

Authors:Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Hong Cheng, Jeffrey Xu Yu
Title: Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study
Abstract:
Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
Chinese: 本研究探讨大型语言模型(LLMs)能否缓解图持续学习中的灾难性遗忘,提出了一种简单而有效的SimGCL方法,显著超越先前基于图神经网络的方法,并推出了便于未来研究使用的基准测试平台。
English: This study explores whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL), proposing a simple-yet-effective method called SimGCL that significantly outperforms previous GNN-based approaches and introducing an easy-to-use benchmark for future research.

Authors:Rafiu Adekoya Badekale, Adewale Akinfaderin
Title: AI-Driven Climate Policy Scenario Generation for Sub-Saharan Africa
Abstract:
Climate policy scenario generation and evaluation have traditionally relied on integrated assessment models (IAMs) and expert-driven qualitative analysis. These methods enable stakeholders, such as policymakers and researchers, to anticipate impacts, plan governance strategies, and develop mitigation measures. However, traditional methods are often time-intensive, reliant on simple extrapolations of past trends, and limited in capturing the complex and interconnected nature of energy and climate issues. With the advent of artificial intelligence (AI), particularly generative AI models trained on vast datasets, these limitations can be addressed, ensuring robustness even under limited data conditions. In this work, we explore the novel method that employs generative AI, specifically large language models (LLMs), to simulate climate policy scenarios for Sub-Saharan Africa. These scenarios focus on energy transition themes derived from the historical United Nations Climate Change Conference (COP) documents. By leveraging generative models, the project aims to create plausible and diverse policy scenarios that align with regional climate goals and energy challenges. Given limited access to human evaluators, automated techniques were employed for scenario evaluation. We generated policy scenarios using the llama3.2-3B model. Of the 34 generated responses, 30 (88%) passed expert validation, accurately reflecting the intended impacts provided in the corresponding prompts. We compared these validated responses against assessments from a human climate expert and two additional LLMs (gemma2-2B and mistral-7B). Our structured, embedding-based evaluation framework shows that generative AI effectively generate scenarios that are coherent, relevant, plausible, and diverse. This approach offers a transformative tool for climate policy planning in data-constrained regions.
中文: 传统气候政策方法依赖综合评估模型和专家分析,而生成式AI通过大语言模型为撒哈拉以南非洲生成多样且合理的能源转型情景,经专家验证有效,为数据受限地区提供了变革性规划工具。
English: Traditional climate policy methods are being transformed by generative AI, which efficiently creates diverse and plausible scenarios, as demonstrated by a study using LLMs for Sub-Saharan Africa's energy transition with high expert validation rates.

Authors:Yang Liu, Silin Cheng, Xinwei He, Sebastien Ourselin, Lei Tan, Gen Luo
Title: WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation
Abstract:
Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at https://github.com/MRUIL/WeakMCN.
中文: 本文提出WeakMCN多任务协同网络,通过双分支架构结合动态视觉特征增强和协同一致性模块,联合学习弱监督指代表达式理解与分割任务,在多个基准测试中显著超越单任务方法并展现出优异的半监督泛化能力。
English: The paper introduces WeakMCN, a multi-task collaborative network that jointly learns weakly supervised referring expression comprehension and segmentation through a dual-branch architecture with innovative modules, achieving superior performance on benchmarks and strong generalization in semi-supervised settings.

Authors:Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang
Title: Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Abstract:
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.
中文: 本文提出了一种利用集束搜索和基于大语言模型的模拟的新方法,有效识别大语言模型的跨语言弱点,揭示了目标语言中准确率显著下降的现象,并证明语言亲缘关系越近的语种表现模式越相似。
English: This paper introduces a novel methodology using beam search and LLM-based simulation to efficiently identify cross-lingual weaknesses in Large Language Models, revealing significant accuracy drops in target languages and demonstrating that linguistically related languages share similar performance patterns.

Authors:Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization
Abstract:
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $δ$-Guided Bit Switching ($δ$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.
中文: DVD-Quant提出了一种无数据量化框架,通过BGR、ARQ和δ-GBS三项创新技术,实现了视频DiT模型的W4A4训练后量化,在保持视觉质量的同时提速约2倍,且无需校准数据。
English: DVD-Quant introduces a data-free quantization framework with three innovations—BGR, ARQ, and δ-GBS—that enable efficient W4A4 post-training quantization for Video DiTs, achieving near 2× speedup while preserving visual quality without calibration data.

Authors:Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization
Abstract:
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on computation-heavy and inflexible calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Bounded-init Grid Refinement (BGR) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $δ$-Guided Bit Switching ($δ$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on advanced DiT models while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.
中文: DVD-Quant提出了一种无数据量化框架,通过BGR、ARQ和δ-GBS三项创新技术,实现了视频DiT模型的W4A4训练后量化,在保持视觉质量的同时提速约2倍,且无需校准数据。
English: DVD-Quant introduces a data-free quantization framework with three innovations—BGR, ARQ, and δ-GBS—that enable efficient W4A4 post-training quantization for Video DiTs, achieving near 2× speedup while preserving visual quality without calibration data.

Authors:Yicheng Lin, Yunlong Jiang, Xujia Jiao, Bin Han
Title: Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU
Abstract:
Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization.
中文摘要:该研究提出一种分层视觉定位框架,通过实时手工特征进行相对位姿跟踪,结合选择性学习特征实现绝对定位,在光照变化下平均误差降低47%,显著提升定位一致性并保持CPU高效运行。
English Summary: The proposed hierarchical visual localization framework integrates real-time handcrafted features for relative pose tracking with selective learned features for absolute positioning, achieving 47% error reduction and enhanced consistency under photometric variations while maintaining CPU efficiency.

Authors:Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye
Title: ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: https://github.com/LiangJian24/ThanoRA.
中文摘要:ThanoRA是一种无需引入额外结构或推理开销的多任务低秩适配框架,通过异构感知的子空间分配和多样性保持训练,在多任务学习中表现卓越,其性能超越现有方法及独立任务微调。
English Summary: ThanoRA is a multi-task low-rank adaptation framework that effectively enhances foundation models' performance across multiple tasks without adding extra structures or inference costs, outperforming existing methods and even separate task-specific fine-tuning.

Authors:Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye
Title: ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in several specific tasks simultaneously, motivating the need for efficient multi-task downstream adaptation. To address this need, existing studies have primarily explored two directions: Model Merging with LoRA, which shows advantages in training-free scenarios but still lags behind multi-task training in overall performance; and MoE-based LoRA approaches, which improve multi-task learning performance but introduce routers that hinder the mergeability of LoRA parameters and incur considerable inference overhead, thereby limiting real-world deployment practicality. To this end, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables effective, efficient and unified multi-task downstream adaptation without introducing additional structure. ThanoRA performs multi-task learning by tailoring subspace allocation at initialization and enforcing diversity preservation throughout training: it allocates varying dimensions to construct task-specific low-rank subspaces driven by inter-task heterogeneity, enabling fine-grained knowledge injection, while diversity-preserving regularization mitigates task interference and subspace collapse, thereby fully exploiting the low-rank capacity. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently outperforms strong baselines, surpassing even separate task-specific fine-tuning, while introducing no additional structures or inference overhead. Our code will be publicly available at: https://github.com/LiangJian24/ThanoRA.
中文摘要:ThanoRA是一种无需引入额外结构或推理开销的多任务低秩适配框架,通过异构感知的子空间分配和多样性保持训练,在多任务学习中表现卓越,其性能超越现有方法及独立任务微调。
English Summary: ThanoRA is a multi-task low-rank adaptation framework that effectively enhances foundation models' performance across multiple tasks without adding extra structures or inference costs, outperforming existing methods and even separate task-specific fine-tuning.

Authors:Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol
Title: Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Abstract:
In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.
中文: 本研究发布了DZEN数据集,包含5000多道不丹中学生使用的宗喀语与英语平行试题,发现大型语言模型在两种语言间存在显著性能差异,并证明思维链提示和英语翻译能有效提升宗喀语问题的回答准确率。
English: This study introduces DZEN, a parallel dataset of over 5,000 Dzongkha and English questions for Bhutanese students, revealing significant performance gaps in LLMs between the two languages and showing that Chain-of-Thought prompting and English translations improve Dzongkha question accuracy.

Authors:Li Wang, Guangqi Yang, Lei Yang, Ziying Song, Xinyu Zhang, Ying Chen, Lin Liu, Junjie Gao, Zhiwei Li, Qingshan Yang, Jun Li, Liangliang Wang, Wenhao Yu, Bin Xu, Weida Wang, Huaping Liu
Title: S2R-Bench: A Sim-to-Real Evaluation Benchmark for Autonomous Driving
Abstract:
Safety is a long-standing and the final pursuit in the development of autonomous driving systems, with a significant portion of safety challenge arising from perception. How to effectively evaluate the safety as well as the reliability of perception algorithms is becoming an emerging issue. Despite its critical importance, existing perception methods exhibit a limitation in their robustness, primarily due to the use of benchmarks are entierly simulated, which fail to align predicted results with actual outcomes, particularly under extreme weather conditions and sensor anomalies that are prevalent in real-world scenarios. To fill this gap, in this study, we propose a Sim-to-Real Evaluation Benchmark for Autonomous Driving (S2R-Bench). We collect diverse sensor anomaly data under various road conditions to evaluate the robustness of autonomous driving perception methods in a comprehensive and realistic manner. This is the first corruption robustness benchmark based on real-world scenarios, encompassing various road conditions, weather conditions, lighting intensities, and time periods. By comparing real-world data with simulated data, we demonstrate the reliability and practical significance of the collected data for real-world applications. We hope that this dataset will advance future research and contribute to the development of more robust perception models for autonomous driving. This dataset is released on https://github.com/adept-thu/S2R-Bench.
中文: 本研究提出了首个基于真实场景的自动驾驶感知鲁棒性基准S2R-Bench,通过收集多路况下的传感器异常数据并与仿真数据对比,全面评估算法在极端天气等实际场景中的可靠性。
English: This study introduces S2R-Bench, the first real-world corruption robustness benchmark for autonomous driving perception, designed to evaluate algorithm reliability under diverse conditions like weather anomalies and sensor failures by comparing real and simulated data.

Authors:Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu
Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Abstract:
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
中文: MAVL基准和SylAVL-CoT模型通过整合多模态线索与音节约束,显著提升了可唱歌词翻译的自然度与准确性,全面优于纯文本翻译方法。
English: The MAVL benchmark and SylAVL-CoT model enhance singable lyrics translation by integrating multimodal cues and syllabic constraints, significantly outperforming text-only methods in both naturalness and accuracy.

Authors:Faithful Chiagoziem Onwuegbuche, Adelodun Olaoluwa, Anca Delia Jurcut, Liliana Pasquale
Title: MLRan: A Behavioural Dataset for Ransomware Analysis and Detection
Abstract:
Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.
中文: 本文提出了MLRan行为勒索软件数据集,包含64个家族的4800多个样本,并证明基于该数据集的机器学习模型检测准确率高达98.7%,同时通过可解释性分析揭示了注册表篡改等关键恶意行为特征。
English: This paper introduces MLRan, a comprehensive behavioral ransomware dataset with over 4,800 samples across 64 families, and demonstrates that machine learning models trained on it achieve up to 98.7% accuracy in detection by identifying key malicious behaviors like registry tampering.

Authors:Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Title: PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs
Abstract:
Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.
中文: 针对长思维链大语言模型中KV缓存导致的高内存开销问题,现有量化方法因累积误差和短上下文校准而性能下降;提出的PM-KVQ通过渐进混合精度量化和位置插值校准策略,在相同内存预算下将推理基准性能提升最高达8%。
English: Recent advancements in long Chain-of-Thought reasoning for Large Language Models face significant memory overhead from KV Cache, which existing quantization methods degrade due to cumulative errors and short-context calibration; the proposed PM-KVQ addresses these with progressive mixed-precision quantization and positional interpolation, improving benchmark performance by up to 8%.

Authors:Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu
Title: Spiking Transformers Need High Frequency Information
Abstract:
Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In this work, however, we reveal for the first time that spiking neurons preferentially propagate low-frequency information. We hypothesize that the rapid dissipation of high-frequency components is the primary cause of performance degradation. For example, on Cifar-100, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73%; interestingly, replacing it with Max-Pooling (high-pass) pushes the top-1 accuracy to 79.12%, surpassing the well-tuned Spikformer baseline by 0.97%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: extra Max-Pooling in patch embedding and Depth-Wise Convolution in place of self-attention. Notably, our Max-Former (63.99 M) hits the top-1 accuracy of 82.39% on ImageNet, showing a +7.58% improvement over Spikformer with comparable model size (74.81%, 66.34 M). We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks, beyond the established practice in standard deep learning. \href{https://github.com/bic-L/Spiking-Transformers-Need-High-Frequency-Information}{Code} is available.
中文: 脉冲变换器因偏好传播低频信息而导致高频信号损失和性能下降;提出的Max-Former模型通过引入最大池化和深度卷积增强高频成分,在ImageNet等基准测试中显著提升了准确率。
English: Spiking Transformers are energy-efficient but underperform due to their tendency to propagate low-frequency information, which causes high-frequency signal loss and reduced accuracy; the proposed Max-Former model enhances performance by incorporating high-frequency components through Max-Pooling and Depth-Wise Convolution, achieving significant accuracy improvements on benchmarks like ImageNet.

Authors:Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
Title: Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
Abstract:
Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
中文摘要:Flex-Judge是一种基于推理的多模态评估模型,通过少量文本推理数据即可泛化至多种模态和评估形式,以更少训练资源实现了与先进模型相媲美的性能。
English Summary: Flex-Judge is a reasoning-guided multimodal judge model that uses minimal textual reasoning data to generalize effectively across multiple modalities and evaluation formats, achieving competitive performance with fewer training resources.

Authors:Dongyang Jin, Chao Fan, Jingzhe Ma, Jingkai Zhou, Weihua Chen, Shiqi Yu
Title: On Denoising Walking Videos for Gait Recognition
Abstract:
To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at https://github.com/ShiqiYu/OpenGait.
中文摘要:DenoisingGait提出了一种新颖的步态识别方法,利用生成扩散模型和几何驱动的特征匹配模块来过滤身份无关特征并创建流式步态特征场,在多个数据集上实现了最先进的性能。
English Summary: DenoisingGait introduces a novel gait recognition method using generative diffusion models and a geometry-driven Feature Matching module to filter out identity-irrelevant cues and create a flow-like Gait Feature Field, achieving state-of-the-art performance on multiple datasets.

Authors:Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li
Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG
Abstract:
Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.
Chinese: 摘要提出了一种无需训练的辩论增强检索生成框架(DRAG),通过在检索阶段引入结构化辩论和在生成阶段采用对抗性辩论,有效提升检索可靠性、减少幻觉现象,从而显著提高事实准确性。
English: The abstract introduces Debate-Augmented RAG (DRAG), a training-free framework that integrates multi-agent debate mechanisms to improve retrieval reliability and reduce hallucinations by employing structured debates during retrieval and adversarial debates in generation, thereby enhancing factual accuracy.

Authors:Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Title: Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Abstract:
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
中文: 本研究提出动态分配训练资源和自适应调整温度的策略,以优化大型语言模型的强化学习过程,实现高效训练并保持探索能力。
English: This study introduces a dynamic rollout allocation mechanism and adaptive temperature adjustment strategy to enhance reinforcement learning for large language models, enabling more efficient training and sustained exploratory capacity.

Authors:Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, Panganamala R. Kumar
Title: Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models
Abstract:
Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025}{github.com/bluewoods127/DB-2025.
中文: 强化学习虽能调整扩散模型以符合单一目标,但在处理用户和场景各异的多重冲突偏好时存在局限。
English: Reinforcement learning has been used to align diffusion models with single objectives, but this approach is limited when balancing multiple conflicting preferences, which vary by user and context.

Authors:Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Abstract:
Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.
中文摘要:强化微调技术显著提升了多模态大语言模型的推理能力,本文通过五大改进方向与未来研究路径,为通用人工智能发展的关键阶段提供了重要见解。
English Summary: Reinforcement fine-tuning significantly enhances the reasoning capabilities of multimodal large language models, as detailed through five key improvements and future research directions in this pivotal AGI development stage.

Authors:Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu
Title: Preserving AUC Fairness in Learning with Noisy Protected Groups
Abstract:
The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/AUC_Fairness_with_Noisy_Groups.
Chinese: 本文首次提出在受保护群体标签存在噪声的情况下,采用分布鲁棒优化方法确保AUC公平性的稳健方案,并通过在表格和图像数据集上的大量实验验证了该方法优于现有技术。
English: This paper introduces the first robust approach to ensuring AUC fairness under noisy protected groups, using distributionally robust optimization with theoretical guarantees, and demonstrates its superiority over existing methods through extensive experiments on tabular and image datasets.

Authors:Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Title: CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs
Abstract:
Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.
Chinese: 我们提出了CLaDMoP,一种用于临床试验结果预测的新预训练方法,通过配对匹配任务提升泛化能力,在PR-AUC和ROC-AUC指标上较现有基线取得显著提升。
English: We introduce CLaDMoP, a novel pre-training method for clinical trial outcome prediction that uses a pair matching task to enhance generalizability and achieves significant improvements in PR-AUC and ROC-AUC over existing baselines.

Authors:Haoyu Yang, Yuxiang Cai, Jintao Chen, Xuhong Zhang, Wenhui Lei, Xiaoming Shi, Jianwei Yin, Yankai Jiang
Title: TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation
Abstract:
3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at https://github.com/yhy-whu/TK-Mamba.
中文: 本研究提出了一种结合Mamba和3D群有理KAN的新型多模态框架,通过增强空间建模和文本驱动语义对齐技术,有效解决了三维医学图像分割中的计算效率与空间依赖难题,实现了最先进的性能表现。
English: This study introduces a novel multimodal framework integrating Mamba and 3D-Group-Rational KAN to address computational inefficiency and spatial dependency challenges in 3D medical image segmentation, achieving state-of-the-art performance through enhanced spatial modeling and text-driven semantic alignment.

Authors:Yiheng Li, Feng Liang, Dan Kondratyuk, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
Title: Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility
Abstract:
The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to >4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at https://github.com/yhli123/Immiscible-Diffusion.
中文摘要:本研究提出了一种广义的互溶性降低方法,通过K近邻噪声选择和图像缩放等实现方式,将扩散模型训练速度提升高达4倍以上,同时保持生成多样性,并为最优传输在扩散训练中的作用提供了新视角。
English Summary: This work introduces a generalized miscibility reduction approach that extends beyond linear assignment to accelerate diffusion model training by up to 4x through implementations like KNN noise selection and image scaling, while maintaining generative diversity and offering new insights into optimal transport's role in diffusion training.

Authors:Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee
Title: Test-Time Adaptation with Binary Feedback
Abstract:
Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback. This setting uses a few binary feedback inputs from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at https://github.com/taeckyung/BiTTA.
中文:BiTTA提出了一种基于二元反馈的测试时自适应框架,通过强化学习平衡不确定样本的反馈引导优化与置信预测的自主适应,在极端域偏移下以最少标注成本实现了显著性能提升。
English: BiTTA introduces a test-time adaptation framework using binary feedback to guide model updates on uncertain samples while self-adapting confident predictions, achieving significant accuracy gains with minimal labeling effort under severe domain shifts.

Authors:Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, Yiming Yang
Title: Enhancing Training Data Attribution with Representational Optimization
Abstract:
Training data attribution (TDA) methods aim to measure how training data impacts a model's predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at https://github.com/sunnweiwei/AirRep.
中文: AirRep是一种基于表示的可扩展方法,通过优化学习任务特定和模型对齐的表征来进行训练数据归因,在显著提高效率的同时实现了与基于梯度方法相媲美的性能。
English: AirRep is a scalable, representation-based method that learns task-specific and model-aligned representations for training data attribution, achieving performance comparable to gradient-based approaches with significantly higher efficiency.

Authors:Guodong Du, Xuanning Zhou, Junlin Li, Zhuo Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li
Title: Knowledge Grafting of Large Language Models
Abstract:
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
Chinese: GraftLLM提出了一种新颖的跨能力迁移方法,通过SkillPack格式在异构模型间高效存储和传递知识,既能防止灾难性遗忘,又能实现可扩展的持续学习。
English: GraftLLM introduces a novel cross-capability transfer method using SkillPack format to efficiently store and transfer knowledge between heterogeneous models while preventing catastrophic forgetting and enabling scalable continual learning.

Authors:Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, Yisen Wang
Title: G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning
Abstract:
Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully. Our implementation is open-sourced at https://github.com/PKU-ML/G1, with models and datasets hosted on Hugging Face collections https://huggingface.co/collections/PKU-ML/g1-683d659e992794fc99618cf2 for broader accessibility.
中文: G1方法通过在合成的Erdős数据集上进行强化学习,显著提升了大语言模型的图推理能力,使仅30亿参数的模型性能超越庞大模型,并能良好泛化且不损害通用推理能力。
English: The G1 approach uses reinforcement learning on the synthetic Erdős dataset to significantly enhance LLMs' graph reasoning, enabling a compact 3B model to outperform much larger models and generalize well without compromising general abilities.

Authors:Jialiang Sun, Yuzhi Tang, Ao Li, Chris J. Maddison, Kuldeep S. Meel
Title: Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions
Abstract:
Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical competitions, in particular, present two challenging problem types: theorem proving, which requires rigorous proofs of stated conclusions, and answer construction, which involves hypothesizing and formally verifying mathematical objects. Large Language Models (LLMs) effectively generate creative candidate answers but struggle with formal verification, while symbolic provers ensure rigor but cannot efficiently handle creative conjecture generation. We introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and pattern-driven conjecturing with formal theorem proving. We present ConstructiveBench, a dataset of 3,431 answer-construction problems in various math competitions with verified Lean formalizations. On the ConstructiveBench dataset, ECP improves the accuracy of answer construction from a Chain-of-Thought (CoT) baseline of 14.54% to 45.06% with the gpt-4.1-mini model. Moreover, combined with ECP's constructed answers, the state-of-the-art DeepSeek-Prover-V2-7B model generates correct proofs for 858 of the 3,431 constructive problems in Lean, achieving 25.01% accuracy compared to 9.86% for symbolic-only baselines. Our code and dataset are publicly available at https://github.com/JackSun200312/ECP.
中文摘要:枚举-猜想-证明(ECP)框架将大语言模型的创造性生成与符号证明器的严谨性相结合,在ConstructiveBench数据集上把数学构造类问题的准确率从14.54%显著提升至45.06%。
English Summary: The Enumerate-Conjecture-Prove (ECP) framework synergizes LLMs' creative generation with symbolic provers' rigor, significantly boosting accuracy on mathematical answer-construction problems from 14.54% to 45.06% in ConstructiveBench.

Authors:Junlin Wang, Zhiyun Lin
Title: Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning
Abstract:
Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon
中文: 本文提出ICon方法,通过对视觉变换器的令牌级表征进行对比学习,分离智能体与环境相关令牌,形成具身化视觉表征,从而提升机器人操作策略的学习效果与跨机器人迁移能力。
English: The paper introduces ICon, a contrastive learning method for Vision Transformers that separates agent-specific and environment-specific tokens to create body-relevant visual representations, enhancing robotic manipulation policy learning and transferability.

Authors:Mengran Li, Pengyu Zhang, Wenbin Xing, Yijia Zheng, Klim Zaporojets, Junzhou Chen, Ronghui Zhang, Yong Zhang, Siyuan Gong, Jia Hu, Xiaolei Ma, Zhiyuan Liu, Paul Groth, Marcel Worring
Title: A Survey of Large Language Models for Data Challenges in Graphs
Abstract:
Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. While graph learning has achieved remarkable progress, real-world graph data presents a number of challenges that significantly hinder the learning process. In this survey, we focus on four fundamental data-centric challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recently, Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey focuses on how LLMs can address four fundamental data-centric challenges in graph-structured data, thereby improving the effectiveness of graph learning. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: https://github.com/limengran98/Awesome-Literature-Graph-Learning-Challenges.
Chinese: 本综述探讨大型语言模型如何利用语义推理和外部知识应对图学习中的四个核心数据挑战——不完整性、不平衡性、跨域异质性和动态不稳定性,从而提升图学习效能。
English: This survey explores how Large Language Models (LLMs) can address four key data-centric challenges in graph learning—incompleteness, imbalance, cross-domain heterogeneity, and dynamic instability—by leveraging semantic reasoning and external knowledge to enhance learning effectiveness.

Authors:Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang
Title: HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model
Abstract:
Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more "honest" in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at https://github.com/jkwang28/HonestFace .
中文摘要:HonestFace是一种新颖的人脸复原方法,通过身份嵌入器和掩码对齐技术强调身份一致性与纹理真实性,实现了卓越的保真度并超越了现有最优方法。
English Summary: HonestFace is a novel face restoration method that emphasizes identity consistency and texture realism through an identity embedder and masked alignment technique, achieving superior fidelity and outperforming state-of-the-art methods.

Authors:Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu
Title: A Survey of LLM $\times$ DATA
Abstract:
The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
中文: 本综述探讨了大型语言模型与数据管理的双向融合,既涵盖数据系统通过处理、存储和服务支持模型开发,也涉及模型如何优化数据操作、分析和系统管理。
English: This survey explores the bidirectional integration between large language models (LLMs) and data management, covering how data systems support LLM development through processing, storage, and serving, while LLMs enhance data tasks like manipulation, analysis, and system optimization.

Authors:Zhining Liu, Ze Yang, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
Title: Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting
Abstract:
Time-series forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grained sample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models. Code is available at https://github.com/ZhiningLiu1998/TimeFuse.
中文: TimeFuse是一种新颖的时序预测框架,通过元特征和可学习的融合器对不同样本自适应地融合多种预测模型,在各类预测任务中均实现了优于现有最优模型的性能提升。
English: TimeFuse is a novel framework that adaptively fuses multiple time-series forecasting models at the sample level using meta-features and a learnable fusor, achieving universal improvements over state-of-the-art models across various forecasting tasks.

Authors:Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer
Title: DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces
Abstract:
Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at https://github.com/RomeoV/KSVD.jl.
Chinese: 本研究提出可扩展的字典学习算法Double-Batch KSVD,在解构Transformer嵌入表示的任务中与稀疏自编码器取得相当性能,既验证了现有方法的有效性,也展现了传统优化方法在大规模机制可解释性研究中的应用潜力。
English: This study introduces Double-Batch KSVD, a scalable dictionary learning algorithm that achieves competitive performance with sparse autoencoders in disentangling transformer embeddings, demonstrating both the effectiveness of existing methods and the potential of traditional optimization approaches for large-scale mechanistic interpretability.

Authors:Afshin Bozorgpour, Sina Ghorbani Kolahi, Reza Azad, Ilker Hacihaliloglu, Dorit Merhof
Title: CENet: Context Enhancement Network for Medical Image Segmentation
Abstract:
Medical image segmentation, particularly in multi-domain scenarios, requires precise preservation of anatomical structures across diverse representations. While deep learning has advanced this field, existing models often struggle with accurate boundary representation, variability in organ morphology, and information loss during downsampling, limiting their accuracy and robustness. To address these challenges, we propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations. First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner. Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations. Extensive evaluations on both radiology and dermoscopic datasets demonstrate that CENet outperforms state-of-the-art (SOTA) methods in multi-organ segmentation and boundary detail preservation, offering a robust and accurate solution for complex medical image analysis tasks. The code is publicly available at https://github.com/xmindflow/cenet.
中文:提出的上下文增强网络(CENet)通过创新模块提升医学图像分割中的边界精度和空间完整性,在多个数据集上展现出优于现有方法的性能。
English: The proposed Context Enhancement Network (CENet) introduces innovative modules to improve boundary precision and spatial integrity in medical image segmentation, demonstrating superior performance over existing methods across diverse datasets.

Authors:Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong
Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding
Abstract:
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
中文: DanmakuTPPBench推出了一个多模态基准,整合了来自B站弹幕的时间事件数据和问答数据集,以推动时序点过程建模的发展,揭示了现有方法在处理时序-文本-视觉推理方面的不足。
English: DanmakuTPPBench introduces a multi-modal benchmark combining temporal event data from Bilibili's bullet comments with a QA dataset to advance Temporal Point Process modeling, revealing current methods' limitations in handling temporal-textual-visual reasoning.

Authors:Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin
Title: Taming Diffusion for Dataset Distillation with High Representativeness
Abstract:
Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: https://github.com/lin-zhao-resoLve/D3HR.
中文: 本文提出D^3HR框架,通过DDIM反演和高效采样方案解决当前基于扩散的数据集蒸馏方法中的关键问题,生成高代表性数据集,在不同模型架构下均比现有方法获得更高准确率。
English: This paper introduces D^3HR, a diffusion-based framework that addresses key issues in dataset distillation by using DDIM inversion and an efficient sampling scheme to generate highly representative datasets, achieving superior accuracy across various model architectures compared to existing methods.

Authors:Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
Title: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abstract:
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat .
Chinese: 本研究提出了一种方法,通过生成文化和语言上定制化的数据来增强大语言模型对低资源语言的支持,并以尼罗河聊天模型为例,展示了其在埃及和摩洛哥方言的理解、翻译及文化对齐方面的卓越表现。
English: This research introduces a methodology to enhance LLMs for low-resource languages by generating culturally and linguistically tailored data, demonstrated through NileChat, a model that excels in understanding, translation, and cultural alignment for Egyptian and Moroccan dialects.

Authors:Pingchuan Ma, Ziang Yin, Qi Jing, Zhengqi Gao, Nicholas Gangi, Boyang Zhang, Tsung-Wei Huang, Zhaoran Huang, Duane S. Boning, Yu Yao, Jiaqi Gu
Title: SP2RINT: Spatially-Decoupled Physics-Inspired Progressive Inverse Optimization for Scalable, PDE-Constrained Meta-Optical Neural Network Training
Abstract:
DONNs leverage light propagation for efficient analog AI and signal processing. Advances in nanophotonic fabrication and metasurface-based wavefront engineering have opened new pathways to realize high-capacity DONNs across various spectral regimes. Training such DONN systems to determine the metasurface structures remains challenging. Heuristic methods are fast but oversimplify metasurfaces modulation, often resulting in physically unrealizable designs and significant performance degradation. Simulation-in-the-loop optimizes implementable metasurfaces via adjoint methods, but is computationally prohibitive and unscalable. To address these limitations, we propose SP2RINT, a spatially decoupled, progressive training framework that formulates DONN training as a PDE-constrained learning problem. Metasurface responses are first relaxed into freely trainable transfer matrices with a banded structure. We then progressively enforce physical constraints by alternating between transfer matrix training and adjoint-based inverse design, avoiding per-iteration PDE solves while ensuring final physical realizability. To further reduce runtime, we introduce a physics-inspired, spatially decoupled inverse design strategy based on the natural locality of field interactions. This approach partitions the metasurface into independently solvable patches, enabling scalable and parallel inverse design with system-level calibration. Evaluated across diverse DONN training tasks, SP2RINT achieves digital-comparable accuracy while being 1825 times faster than simulation-in-the-loop approaches. By bridging the gap between abstract DONN models and implementable photonic hardware, SP2RINT enables scalable, high-performance training of physically realizable meta-optical neural systems. Our code is available at https://github.com/ScopeX-ASU/SP2RINT
中文: SP2RINT是一种创新的训练框架,通过空间解耦和渐进式物理约束实施,实现了可扩展且高效的衍射光学神经网络设计,在保持与数字系统相当精度的同时,比传统方法提速1825倍。
English: SP2RINT is a novel training framework that enables scalable and efficient design of physically realizable diffractive optical neural networks by decoupling spatial constraints and progressively enforcing physical realizability, achieving digital-comparable accuracy with 1825× speedup over conventional methods.

Authors:Minwoo Jung, Lanke Frank Tarimo Fu, Maurice Fallon, Ayoung Kim
Title: ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models
Abstract:
LiDAR Place Recognition (LPR) is a key component in robotic localization, enabling robots to align current scans with prior maps of their environment. While Visual Place Recognition (VPR) has embraced Vision Foundation Models (VFMs) to enhance descriptor robustness, LPR has relied on task-specific models with limited use of pre-trained foundation-level knowledge. This is due to the lack of 3D foundation models and the challenges of using VFM with LiDAR point clouds. To tackle this, we introduce ImLPR, a novel pipeline that employs a pre-trained DINOv2 VFM to generate rich descriptors for LPR. To the best of our knowledge, ImLPR is the first method to utilize a VFM for LPR while retaining the majority of pre-trained knowledge. ImLPR converts raw point clouds into novel three-channel Range Image Views (RIV) to leverage VFM in the LiDAR domain. It employs MultiConv adapters and Patch-InfoNCE loss for effective feature learning. We validate ImLPR on public datasets and outperform state-of-the-art (SOTA) methods across multiple evaluation metrics in both intra- and inter-session LPR. Comprehensive ablations on key design choices such as channel composition, RIV, adapters, and the patch-level loss quantify each component's impact. We release ImLPR as open source for the robotics community: https://github.com/minwoo0611/ImLPR.
Chinese: ImLPR是一种新颖的激光雷达地点识别方法,它通过将点云转换为距离图像视图并利用预训练的DINOv2视觉基础模型生成鲁棒描述符,在多个数据集上实现了最先进的性能。
English: ImLPR is a novel LiDAR place recognition pipeline that leverages a pre-trained DINOv2 vision foundation model to generate robust descriptors by converting point clouds into range image views, achieving state-of-the-art performance across multiple datasets.

Authors:Jianyang Gu, Haonan Wang, Ruoxi Jia, Saeed Vahidian, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
Title: CONCORD: Concept-Informed Diffusion for Dataset Distillation
Abstract:
Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in https://github.com/vimar-gu/CONCORD.
中文: 提出的CONCORD方法通过整合大型语言模型的概念理解,增强了数据集蒸馏的可控性和图像细节准确性,在ImageNet基准测试中取得了领先性能。
English: The proposed CONCORD method enhances dataset distillation by integrating large language models' concept understanding to improve controllability and detail accuracy in generated images, achieving state-of-the-art results on ImageNet benchmarks.

Authors:Míriam Máximo, Antonio Santo, Arturo Gil, Mónica Ballesta, David Valiente
Title: A Coarse to Fine 3D LiDAR Localization with Deep Local Features for Long Term Robot Navigation in Large Environments
Abstract:
The location of a robot is a key aspect in the field of mobile robotics. This problem is particularly complex when the initial pose of the robot is unknown. In order to find a solution, it is necessary to perform a global localization. In this paper, we propose a method that addresses this problem using a coarse-to-fine solution. The coarse localization relies on a probabilistic approach of the Monte Carlo Localization (MCL) method, with the contribution of a robust deep learning model, the MinkUNeXt neural network, to produce a robust description of point clouds of a 3D LiDAR within the observation model. For fine localization, global point cloud registration has been implemented. MinkUNeXt aids this by exploiting the outputs of its intermediate layers to produce deep local features for each point in a scan. These features facilitate precise alignment between the current sensor observation and one of the point clouds on the map. The proposed MCL method incorporating Deep Local Features for fine localization is termed MCL-DLF. Alternatively, a classical ICP method has been implemented for this precise localization aiming at comparison purposes. This method is termed MCL-ICP. In order to validate the performance of MCL-DLF method, it has been tested on publicly available datasets such as the NCLT dataset, which provides seasonal large-scale environments. Additionally, tests have been also performed with own data (UMH) that also includes seasonal variations on large indoor/outdoor scenarios. The results, which were compared with established state-of-the-art methodologies, demonstrate that the MCL-DLF method obtains an accurate estimate of the robot localization in dynamic environments despite changes in environmental conditions. For reproducibility purposes, the code is publicly available at https://github.com/miriammaximo/MCL-DLF.git
中文: 本文提出MCL-DLF方法,通过蒙特卡洛定位与深度学习特征相结合的由粗到精策略,实现了在动态季节性环境中基于3D激光雷达的机器人精准定位。
English: This paper introduces MCL-DLF, a coarse-to-fine robot localization method combining Monte Carlo Localization with deep learning features for precise 3D LiDAR point cloud alignment, demonstrating robust performance in dynamic seasonal environments.

Authors:Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong
Title: A Critical Evaluation of Defenses against Prompt Injection Attacks
Abstract:
Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.
Chinese: 本文批评现有针对大型语言模型提示注入攻击的防御措施缺乏系统性评估,并证明当从对抗适应性攻击的有效性和保持模型通用能力两个维度检验时,这些防御措施的实际效果远不如宣称的那么成功。
English: This paper critiques existing defenses against prompt injection attacks on LLMs for lacking comprehensive evaluation and demonstrates that when assessed across effectiveness against adaptive attacks and preservation of model utility, these defenses prove less successful than claimed.

Authors:Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, Zhixuan Chu
Title: Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries--a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models' safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios. We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets are available at https://github.com/Master-PLC/RASS.
中文摘要:大语言模型常因保守的安全对齐而过度拒绝合理查询,但提出的RASS框架通过策略性地识别边界提示来缓解此问题,同时保持跨语言场景下的安全性。
English Summary: Large language models often overrefuse legitimate queries due to conservative safety alignment, but the proposed RASS framework strategically identifies boundary prompts to mitigate this issue while maintaining safety across languages.

Authors:Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
Title: TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
Abstract:
Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.
中文: TAGS框架通过结合通用与专业模型及辅助模块,无需微调即可显著提升多个医学问答基准的准确率,实现卓越的医疗推理性能。
English: TAGS is a test-time framework that combines generalist and specialist models with auxiliary modules to enhance medical reasoning, achieving significant accuracy improvements across multiple benchmarks without fine-tuning.

Authors:Roy Elkayam
Title: Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning
Abstract:
This study presents a novel approach for decomposing urban water demand patterns using Skewed Gaussian Distributions (SGD) to derive behavioral insights and support operational planning. Hourly demand profiles contain critical information for both long-term infrastructure design and daily operations, influencing network pressures, water quality, energy consumption, and overall reliability. By breaking down each daily demand curve into a baseline component and distinct peak components, the proposed SGD method characterizes each peak with interpretable parameters, including peak amplitude, timing (mean), spread (duration), and skewness (asymmetry), thereby reconstructing the observed pattern and uncovering latent usage dynamics. This detailed peak-level decomposition enables both operational applications, e.g. anomaly and leakage detection, real-time demand management, and strategic analyses, e.g. identifying behavioral shifts, seasonal influences, or policy impacts on consumption patterns. Unlike traditional symmetric Gaussian or purely statistical time-series models, SGDs explicitly capture asymmetric peak shapes such as sharp morning surges followed by gradual declines, improving the fidelity of synthetic pattern generation and enhancing the detection of irregular consumption behavior. The method is demonstrated on several real-world datasets, showing that SGD outperforms symmetric Gaussian models in reconstruction accuracy, reducing root-mean-square error by over 50% on average, while maintaining physical interpretability. The SGD framework can also be used to construct synthetic demand scenarios by designing daily peak profiles with chosen characteristics. All implementation code is publicly available at: https://github.com/Relkayam/water-demand-decomposition-sgd
本研究提出了一种利用偏态高斯分布分解城市用水需求模式的新方法,通过精确捕捉非对称峰值特征,显著提高了重构精度和可解释性,从而获取行为洞察并支持运营规划。
This study introduces a novel method using Skewed Gaussian Distributions to decompose urban water demand patterns, enabling behavioral insights and operational planning by accurately capturing asymmetric peak characteristics with improved reconstruction accuracy and interpretability.

Authors:Mingning Guo, Mengwei Wu, Jiarun He, Shaoxian Li, Haifeng Li, Chao Tao
Title: BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs
Abstract:
With the rapid advancement of low-altitude remote sensing and Vision-Language Models (VLMs), Embodied Agents based on Unmanned Aerial Vehicles (UAVs) have shown significant potential in autonomous tasks. However, current evaluation methods for UAV-Embodied Agents (UAV-EAs) remain constrained by the lack of standardized benchmarks, diverse testing scenarios and open system interfaces. To address these challenges, we propose BEDI (Benchmark for Embodied Drone Intelligence), a systematic and standardized benchmark designed for evaluating UAV-EAs. Specifically, we introduce a novel Dynamic Chain-of-Embodied-Task paradigm based on the perception-decision-action loop, which decomposes complex UAV tasks into standardized, measurable subtasks. Building on this paradigm, we design a unified evaluation framework encompassing five core sub-skills: semantic perception, spatial perception, motion control, tool utilization, and task planning. Furthermore, we construct a hybrid testing platform that integrates static real-world environments with dynamic virtual scenarios, enabling comprehensive performance assessment of UAV-EAs across varied contexts. The platform also offers open and standardized interfaces, allowing researchers to customize tasks and extend scenarios, thereby enhancing flexibility and scalability in the evaluation process. Finally, through empirical evaluations of several state-of-the-art (SOTA) VLMs, we reveal their limitations in embodied UAV tasks, underscoring the critical role of the BEDI benchmark in advancing embodied intelligence research and model optimization. By filling the gap in systematic and standardized evaluation within this field, BEDI facilitates objective model comparison and lays a robust foundation for future development in this field. Our benchmark will be released at https://github.com/lostwolves/BEDI .
中文摘要:BEDI基准通过基于感知-决策-行动循环的任务分解框架和混合测试平台,为无人机具身智能体提供了标准化评估体系,填补了该领域系统化评估的空白。
English Summary: The BEDI benchmark is introduced to standardize the evaluation of UAV-Embodied Agents by decomposing complex tasks into measurable subtasks and providing a hybrid testing platform with open interfaces, addressing current limitations in assessment methods.

Authors:Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Title: Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Abstract:
In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
中文: 本文主张将标记约简从传统的效率策略提升为生成式建模的核心原则,强调其在促进多模态整合、减少幻觉、保持长输入连贯性及增强训练稳定性等方面的关键作用,并展望了其在算法设计和跨领域应用中的潜力。
English: This paper repositions token reduction from a mere efficiency strategy to a fundamental principle in generative modeling, arguing it enhances multimodal integration, reduces hallucinations, maintains coherence in long inputs, and improves training stability across vision, language, and multimodal systems.

Authors:Beck LaBash, Shahriar Khushrushahi, Fabian Ruehle
Title: Improving Generative Inverse Design of Rectangular Patch Antennas with Test Time Optimization
Abstract:
We propose a two-stage deep learning framework for the inverse design of rectangular patch antennas. Our approach leverages generative modeling to learn a latent representation of antenna frequency response curves and conditions a subsequent generative model on these responses to produce feasible antenna geometries. We further demonstrate that leveraging search and optimization techniques at test-time improves the accuracy of the generated designs and enables consideration of auxiliary objectives such as manufacturability. Our approach generalizes naturally to different design criteria, and can be easily adapted to more complex geometric design spaces.
中文: 该两阶段深度学习框架通过生成式建模学习天线频率响应的潜在表征并据此生成可行几何结构,结合测试阶段的搜索优化提升设计精度并兼顾可制造性等辅助目标,能自然适应不同设计标准并扩展至复杂几何空间。
English: The proposed two-stage deep learning framework utilizes generative modeling to design rectangular patch antennas by learning frequency response representations and generating corresponding geometries, with test-time optimization enhancing design accuracy and incorporating manufacturability considerations.

Authors:Min Namgung, Yijun Lin, JangHyeon Lee, Yao-Yi Chiang
Title: Less is More: Multimodal Region Representation via Pairwise Inter-view Learning
Abstract:
With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: https://github.com/MinNamgung/CooKIE.
中文摘要:CooKIE提出了一种高效的区域表征学习信息分解方法,无需建模复杂高阶依赖即可捕获多模态共享与独特信息,在降低计算成本的同时展现出优越性能。
English Summary: CooKIE introduces an efficient information factorization method for region representation learning that captures both shared and unique multimodal information without modeling complex high-order dependencies, demonstrating superior performance with reduced computational costs.

Authors:Natia Kukhilava, Tatia Tsmindashvili, Rapael Kalandadze, Anchit Gupta, Sofio Katamadze, François Brémond, Laura M. Ferrari, Philipp Müller, Benedikt Emanuel Wirth
Title: Evaluation in EEG Emotion Recognition: State-of-the-Art Review and Unified Framework
Abstract:
Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.
中文: EEG-ER研究领域缺乏统一的评估标准,为此开发了EEGain开源框架,通过标准化数据处理和评估方法,确保研究结果的可复现性与可比性。
English: EEG-ER research lacks standardized evaluation protocols, prompting the development of EEGain, an open-source framework that ensures reproducible and comparable results by unifying data processing and assessment methods.

Authors:Austin Howard
Title: InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models
Abstract:
Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.
中文摘要:InjectLab是受MITRE ATT&CK启发的开源安全框架,通过包含六大核心战术、25种以上技术的结构化矩阵,系统性地识别针对大语言模型的提示词攻击,并提供检测指南与防护策略。
English Summary: InjectLab is an open-source security framework inspired by MITRE ATT&CK that systematically maps prompt-based attack techniques against Large Language Models, offering detection guidance and mitigation strategies through a structured matrix of 25+ techniques across six core tactics.

Authors:Savya Khosla, Sethuraman TV, Barnett Lee, Alexander Schwing, Derek Hoiem
Title: REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
Abstract:
We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.
中文: 区域编码器网络(REN)是一种快速高效的模型,通过点提示直接生成基于区域的图像表示,实现了60倍的速度提升和35倍的内存节省,同时提高了标记质量,并在多项基准测试中超越了现有方法。
English: The Region Encoder Network (REN) is a fast and efficient model that generates region-based image representations directly from point prompts, achieving up to 60x speed and 35x memory reduction while improving token quality and outperforming existing methods in various benchmarks.

Authors:Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Abstract:
Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in 12 historical eras, covering 14 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM understands classical Arabic through Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release "Fann or Flop" along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.
中文摘要:本研究推出了首个评估大语言模型对阿拉伯诗歌理解的基准“Fann or Flop”,涵盖多个历史时期和诗歌体裁,发现尽管模型在标准阿拉伯语任务中表现优异,但在深层诠释和文化理解方面仍存在困难。
English Summary: The study introduces "Fann or Flop," the first benchmark to evaluate large language models' comprehension of Arabic poetry across historical eras and genres, revealing their struggles with deeper interpretive and cultural understanding despite strong performance on standard tasks.

Authors:Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, Soheil Feizi
Title: Tool Preferences in Agentic LLMs are Unreliable
Abstract:
Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 17 different models. These phenomena, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources. Our code is publicly available at https://github.com/kazemf78/llm-unreliable-tool-preferences.
中文: 大型语言模型选择工具时易受描述篡改的影响,某些编辑后的描述可使工具使用率激增十倍以上,这凸显了建立更可靠协议的必要性。
English: Large language models' tool selection is vulnerable to manipulated descriptions, with edited versions increasing usage over tenfold in some cases, highlighting the need for more reliable protocols.

Authors:Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
Title: One RL to See Them All: Visual Triple Unified Reinforcement Learning
Abstract:
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.
中文: V-Triune提出了一种统一强化学习系统,通过样本级数据格式化、验证器级奖励计算和源级指标监控三大组件,使视觉语言模型能够同时掌握推理与感知任务,并在各类基准测试中实现显著性能提升。
English: V-Triune introduces a unified reinforcement learning system that enables vision-language models to jointly master both reasoning and perception tasks, achieving significant performance gains across diverse benchmarks through its triple-component architecture and novel Dynamic IoU reward mechanism.

Authors:Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky
Title: Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Abstract:
Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.
Chinese: 本研究提出了Instructify这一开源框架,利用开放大语言模型将图像元数据高效转化为高质量视觉指令调优数据,在超越GPT-4生成数据性能的同时,实现了可扩展、可复现的VisIT数据集构建。
English: The study introduces Instructify, an open-source framework that uses open LLMs to efficiently convert image metadata into high-quality Visual Instruction Tuning data, achieving performance improvements over GPT-4 generated data while enabling scalable and reproducible VisIT dataset creation.

Authors:Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, Wayne Xin Zhao
Title: ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework
Abstract:
Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in https://github.com/RUCAIBox/ManuSearch
中文:ManuSearch是一个透明的多智能体框架,通过将深度搜索分解为规划、网络搜索和内容提取三个协作代理,使大型语言模型的深度搜索能力民主化,并在ORION基准测试中显著超越现有系统。
English: ManuSearch is a transparent, multi-agent framework that democratizes deep search for large language models by decomposing the process into planning, web search, and content extraction agents, significantly outperforming existing systems on the new ORION benchmark.

Authors:Yuxin Liu, M. Amin Rahimian, Kiran Garimella
Title: Structural Dynamics of Harmful Content Dissemination on WhatsApp
Abstract:
WhatsApp, a platform with more than two billion global users, plays a crucial role in digital communication, but also serves as a vector for harmful content such as misinformation, hate speech, and political propaganda. This study examines the dynamics of harmful message dissemination in WhatsApp groups, with a focus on their structural characteristics. Using a comprehensive data set of more than 5.1 million messages, including text, images, and videos, collected from approximately 6,000 groups in India, we reconstruct message propagation cascades to analyze dissemination patterns. Our findings reveal that harmful messages consistently achieve greater depth and breadth of dissemination compared to messages without harmful annotations, with videos and images emerging as the primary modes of dissemination. These results suggest a distinctive pattern of dissemination of harmful content. However, our analysis indicates that modality alone cannot fully account for the structural differences in propagation.The findings highlight the critical role of structural characteristics in the spread of these harmful messages, suggesting that strategies targeting structural characteristics of re-sharing could be crucial in managing the dissemination of such content on private messaging platforms.
中文: 该研究发现,WhatsApp上的有害内容比无害信息传播更广更深,视频和图像是主要传播载体,但平台的结构特性对传播模式起着关键作用。
English: This study reveals that harmful content on WhatsApp spreads more widely and deeply than non-harmous messages, with videos and images being primary carriers, yet the platform's structural characteristics play a critical role in dissemination patterns.

Authors:Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi
Title: CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Abstract:
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench
中文: 该研究提出了CheXStruct和CXReasonBench,这是一个用于评估大型视觉语言模型在胸部X光分析中临床推理能力的结构化流程和基准,揭示了这些模型在抽象知识与视觉解释结合方面的不足。
English: The study introduces CheXStruct and CXReasonBench, a structured pipeline and benchmark for evaluating the clinical reasoning of Large Vision-Language Models in chest X-ray analysis, revealing their limitations in linking abstract knowledge with visual interpretation despite extensive testing.

Authors:Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng
Title: An Iterative Framework for Generative Backmapping of Coarse Grained Proteins
Abstract:
The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.
中文: 本研究提出了一种创新的迭代框架,利用条件变分自编码器和图神经网络,有效提升了复杂蛋白质系统中从粗粒度到细粒度表示的数据驱动反向映射的准确性、训练稳定性和物理真实性。
English: This study introduces an innovative iterative framework using conditional Variational Autoencoders and graph neural networks to enhance the accuracy, training stability, and physical realism of data-driven backmapping from coarse-grained to fine-grained representations in complex protein systems.

Authors:Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Abstract:
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code has been released in https://github.com/microsoft/DeepVideoDiscovery.
Chinese: Deep Video Discovery 代理通过采用基于工具的自主搜索策略处理分段视频,克服了大型语言模型在处理长视频时的限制,在 LVBench 等基准测试中取得了领先性能。
English: The Deep Video Discovery agent overcomes LLM limitations in processing long videos by employing an autonomous, tool-based search strategy across segmented clips, achieving state-of-the-art results on benchmarks like LVBench.

Authors:Kaiyan Zhang, Xinghui Li, Jingyi Lu, Kai Han
Title: Semantic Correspondence: Unified Benchmarking and a Strong Baseline
Abstract:
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.
Chinese: 本文首次对计算机视觉中的语义对应方法进行了全面综述,提出了分类体系、对比分析,并开发了一个在多个基准测试中达到最优性能的新基线模型。
English: This paper presents the first comprehensive survey of semantic correspondence methods in computer vision, offering a taxonomy, comparative analysis, and a new baseline model that achieves state-of-the-art performance across multiple benchmarks.

Authors:Zizhao Chen, Yoav Artzi
Title: Knot So Simple: A Minimalistic Environment for Spatial Reasoning
Abstract:
We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.
中文: KnotGym是一个用于空间推理和操作的交互式环境,通过基于绳结交叉数量的可量化复杂度任务,为评估不同人工智能方法提供了测试平台。
English: KnotGym is an interactive environment for spatial reasoning and manipulation that features goal-oriented rope tasks with scalable complexity based on knot crossings, enabling evaluation of various AI methods.

Authors:Xiaobao Wei, Jiawei Liu, Dongbo Yang, Junda Cheng, Changyong Shu, Wei Wang
Title: A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency
Abstract:
We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).
Chinese: 该研究提出了Wavelet-Stereo立体匹配框架,通过小波变换分别处理高低频分量并采用专用特征提取器,解决了频率收敛不一致问题,在KITTI基准测试中取得最优性能。
English: The study introduces Wavelet-Stereo, a stereo matching framework that addresses frequency convergence inconsistency by separately processing high and low-frequency components using wavelet transforms and specialized feature extractors, achieving top performance on KITTI benchmarks.

Authors:Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, Yuhui Zheng
Title: RemoteSAM: Towards Segment Anything for Earth Observation
Abstract:
We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.
Chinese: 本研究提出了RemoteSAM,一个用于地球观测的鲁棒视觉基础模型,通过自动数据引擎构建了同类最大数据集,并采用任务统一范式处理多种视觉任务,以高效性实现了最先进的性能。
English: This study introduces RemoteSAM, a robust visual foundation model for Earth observation that uses an automatic data engine to create the largest dataset of its kind and a task unification paradigm for handling multiple vision tasks, achieving state-of-the-art performance with high efficiency.

Authors:Yao Sun, Sining Chen, Yifan Tian, Xiao Xiang Zhu
Title: Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method
Abstract:
Accurate information on the number of building floors, or above-ground storeys, is essential for household estimation, utility provision, risk assessment, evacuation planning, and energy modeling. Yet large-scale floor-count data are rarely available in cadastral and 3D city databases. This study proposes an end-to-end deep learning framework that infers floor numbers directly from unrestricted, crowdsourced street-level imagery, avoiding hand-crafted features and generalizing across diverse facade styles. To enable benchmarking, we release the Munich Building Floor Dataset, a public set of over 6800 geo-tagged images collected from Mapillary and targeted field photography, each paired with a verified storey label. On this dataset, the proposed classification-regression network attains 81.2% exact accuracy and predicts 97.9% of buildings within +/-1 floor. The method and dataset together offer a scalable route to enrich 3D city models with vertical information and lay a foundation for future work in urban informatics, remote sensing, and geographic information science. Source code and data will be released under an open license at https://github.com/ya0-sun/Munich-SVI-Floor-Benchmark.
中文: 本研究提出一种端到端深度学习框架,可直接从街景图像推断建筑楼层数,在新建公开数据集上达到81.2%的精确准确率,为丰富三维城市模型提供了可扩展的解决方案。
English: This study introduces an end-to-end deep learning framework that accurately infers building floor counts from street-level images, achieving 81.2% exact accuracy on a new public dataset and offering a scalable solution to enrich 3D city models.

Authors:Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Kaçar, Ruben Weber, Margret Keuper
Title: SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification
Abstract:
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
中文摘要:本研究提出SEMSEGBENCH和DETECBENCH基准测试工具,通过大规模评估揭示了当前最优语义分割和物体检测模型在分布偏移和对抗攻击下存在的系统性缺陷,旨在推动超越分类任务的模型可靠性研究。
English Summary: This research introduces SEMSEGBENCH and DETECBENCH benchmarking tools to evaluate the robustness of semantic segmentation and object detection models against distribution shifts and adversarial attacks, revealing systematic weaknesses in state-of-the-art models through extensive testing.

Authors:Honghao Li, Yiwen Zhang, Yi Zhang, Lei Sang, Jieming Zhu
Title: Revisiting Feature Interactions from the Perspective of Quadratic Neural Networks for Click-through Rate Prediction
Abstract:
Hadamard Product (HP) has long been a cornerstone in click-through rate (CTR) prediction tasks due to its simplicity, effectiveness, and ability to capture feature interactions without additional parameters. However, the underlying reasons for its effectiveness remain unclear. In this paper, we revisit HP from the perspective of Quadratic Neural Networks (QNN), which leverage quadratic interaction terms to model complex feature relationships. We further reveal QNN's ability to expand the feature space and provide smooth nonlinear approximations without relying on activation functions. Meanwhile, we find that traditional post-activation does not further improve the performance of the QNN. Instead, mid-activation is a more suitable alternative. Through theoretical analysis and empirical evaluation of 25 QNN neuron formats, we identify a good-performing variant and make further enhancements on it. Specifically, we propose the Multi-Head Khatri-Rao Product as a superior alternative to HP and a Self-Ensemble Loss with dynamic ensemble capability within the same network to enhance computational efficiency and performance. Ultimately, we propose a novel neuron format, QNN-alpha, which is tailored for CTR prediction tasks. Experimental results show that QNN-alpha achieves new state-of-the-art performance on six public datasets while maintaining low inference latency, good scalability, and excellent compatibility. The code, running logs, and detailed hyperparameter configurations are available at: https://github.com/salmon1802/QNN.
Chinese: 本文通过二次神经网络重新审视哈达玛积在点击率预测中的有效性,提出增强的QNN-alpha模型,该模型在保持高效和兼容性的同时实现了最先进的性能。
English: This paper re-examines the Hadamard Product's effectiveness in CTR prediction through Quadratic Neural Networks, proposing the enhanced QNN-alpha model that achieves state-of-the-art performance with improved efficiency and compatibility.

Authors:Yutong Chen, Jiandong Gao, Ji Wu
Title: Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Abstract:
R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen\&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K\&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.
中文:R1式强化学习提升了大型语言模型的推理能力,而提出的再蒸馏技术通过利用强化学习训练的策略提高了小规模蒸馏的效率,以更少的样本和计算量实现了更优的性能。
English: R1-style Reinforcement Learning boosts reasoning in Large Language Models, and the proposed Re-distillation technique enhances small-scale distillation efficiency by leveraging RL-trained policies, achieving superior performance with fewer samples and computation.

Authors:Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi
Title: Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Abstract:
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL
Chinese: HiVE-MIL提出了一种分层视觉语言框架,通过增强多尺度交互和跨模态对齐来改进全切片图像分类,在16样本设置下宏观F1分数最高提升4.1%,显著优于现有方法。
English: HiVE-MIL introduces a hierarchical vision-language framework that enhances multi-scale interactions and cross-modal alignment in whole slide image classification, achieving superior performance over existing methods with up to 4.1% improvement in macro F1 scores.

Authors:Simone Gaisbauer, Prabin Gyawali, Qilin Zhang, Olaf Wysocki, Boris Jutzi
Title: To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models
Abstract:
Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue
中文: 本研究系统比较了基于三维建筑模型的视觉定位中传统与可学习特征匹配方法,发现在具有挑战性的数据集上,可学习方法在精度和鲁棒性方面显著优于传统技术。
English: This study systematically compares classical and learnable feature matching methods for visual localization using 3D building models, finding that learnable approaches significantly outperform traditional techniques in accuracy and robustness on challenging datasets.

Authors:Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh
Title: FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
Abstract:
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.
Chinese: 本研究提出了一种利用离散余弦变换(DCT)矩阵的计算高效方法,通过近似低秩梯度投影来训练大语言模型,在实现与SVD/QR方法相当性能的同时,将运行时间和内存使用量降低了最高达25%。
English: This work introduces a computationally efficient method using Discrete Cosine Transform (DCT) matrices to approximate low-rank gradient projections for training large language models, achieving performance comparable to SVD/QR methods while reducing runtime and memory usage by up to 25%.

Authors:Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach
Title: Diffusion Classifiers Understand Compositionality, but Conditions Apply
Abstract:
Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
中文摘要:本研究通过涵盖多个模型和任务的系统性评估,揭示了扩散分类器在组合性理解方面的条件有效性,并深入分析了数据集领域和时间步敏感性等关键影响因素。
English Summary: This study comprehensively evaluates the discriminative capabilities of diffusion classifiers across diverse compositional tasks, revealing their conditional effectiveness while analyzing factors like dataset domains and timestep sensitivity.

Authors:Haihong Xiao, Jianan Zou, Yuxin Zhou, Ying He, Wenxiong Kang
Title: SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes
Abstract:
We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks & Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at https://github.com/SCUT-BIP-Lab/SplatCo.
中文: SplatCo是一种结合全局三平面与局部网格特征的结构-视图协作高斯泼溅框架,通过结构和多视角优化,实现了复杂户外场景的高保真渲染和卓越重建质量。
English: SplatCo is a collaborative Gaussian splatting framework that integrates global tri-plane and local grid features for high-fidelity rendering of complex outdoor scenes, achieving superior reconstruction quality through structural and multi-view optimization.

Authors:Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang
Title: VeriThinker: Learning to Verify Makes Reasoning Model Efficient
Abstract:
Large Reasoning Models (LRMs) excel at complex tasks using Chain-of-Thought (CoT) reasoning. However, their tendency to overthinking leads to unnecessarily lengthy reasoning chains, dramatically increasing inference costs. To mitigate this issue, we introduce VeriThinker, a novel approach for CoT compression. Unlike conventional methods that fine-tune LRMs directly on the original reasoning task using synthetic concise CoT data, we innovatively fine-tune the model solely through an auxiliary verification task. By training LRMs to accurately verify the correctness of CoT solutions, the LRMs inherently become more discerning about the necessity of subsequent self-reflection steps, thereby effectively suppressing overthinking. Extensive experiments validate that VeriThinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to DeepSeek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0.8% (94.0% to 94.8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2.1% accuracy gain (38.7% to 40.8%). Additionally, our experiments demonstrate that VeriThinker can also be zero-shot generalized to speculative reasoning. Code is available at https://github.com/czg1225/VeriThinker
中文: VeriThinker是一种创新的思维链压缩方法,通过仅基于辅助验证任务微调大型推理模型,有效抑制过度思考,在保持或提升准确率的同时显著缩短推理链长度。
English: VeriThinker is a novel CoT compression method that reduces overthinking in Large Reasoning Models by fine-tuning them solely through an auxiliary verification task, effectively shortening reasoning chains while maintaining or improving accuracy.

Authors:Nayoung Kim, Seongsu Kim, Sungsoo Ahn
Title: Flexible MOF Generation with Torsion-Aware Flow Matching
Abstract:
Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.
Chinese: 该研究提出的两阶段框架通过化学构建块生成和几何结构组装,克服了现有深度生成模型的局限,实现了新型金属有机框架材料的创新设计,并提高了生成精度和独特性。
English: The proposed two-stage framework overcomes the limitations of existing deep generative models by enabling the generation of novel metal-organic frameworks through chemical building block creation and geometric assembly, resulting in improved accuracy and novel MOF designs.

Authors:Nayoung Kim, Seongsu Kim, Sungsoo Ahn
Title: Flexible MOF Generation with Torsion-Aware Flow Matching
Abstract:
Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.
Chinese: 该研究提出的两阶段框架通过化学构建块生成和几何结构组装,克服了现有深度生成模型的局限,实现了新型金属有机框架材料的创新设计,并提高了生成精度和独特性。
English: The proposed two-stage framework overcomes the limitations of existing deep generative models by enabling the generation of novel metal-organic frameworks through chemical building block creation and geometric assembly, resulting in improved accuracy and novel MOF designs.

Authors:Zheyang Huang, Jagannath Aryal, Saeid Nahavandi, Xuequan Lu, Chee Peng Lim, Lei Wei, Hailing Zhou
Title: Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention
Abstract:
Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (https://github.com/ZheyangH/OCGNet).
中文: 跨视角地理定位方法OCGNet通过整合用户指定位置和自适应模块,提升了对象级定位精度,在公开数据集上实现最优性能并具备小样本学习能力,适用于多种应用场景。
English: Cross-view geo-localization using OCGNet enhances object-level accuracy by integrating user-specified locations and adaptive modules, achieving state-of-the-art performance and few-shot learning capabilities for diverse applications.

Authors:Bin Wu, Wei Wang, Yahui Liu, Zixiang Li, Yao Zhao
Title: DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning
Abstract:
Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.
中文摘要:本文提出DiffusionReward,一种用于盲人脸复原的奖励反馈学习框架,通过面部奖励模型和引导梯度优化显著提升人脸细节与身份一致性,实验证明其性能优于现有最先进方法。
English Summary: This paper introduces DiffusionReward, a novel Reward Feedback Learning framework for Blind Face Restoration that enhances facial details and identity consistency through a Face Reward Model and guided gradient optimization, outperforming existing methods.

Authors:Bram Grooten, Farid Hasanov, Chenxiang Zhang, Qiao Xiao, Boqian Wu, Zahra Atashgahi, Ghada Sokar, Shiwei Liu, Lu Yin, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu
Title: NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling
Abstract:
Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce $\textbf{NeuroTrails}$, a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a $\textit{Goldilocks zone}$ of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.
中文摘要:NeuroTrails采用稀疏多头架构和动态拓扑结构,在提升模型集成性能的同时显著减少资源需求,在多种架构和任务中实现了更高的准确性和更强的鲁棒性。
English Summary: NeuroTrails introduces a sparse multi-head architecture with dynamic topology to enhance ensemble performance efficiently, achieving greater accuracy and robustness with fewer parameters across various models and tasks.

Authors:Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, Ying-Cong Chen
Title: ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback
Abstract:
With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: https://github.com/LitaoGuo/ComfyMind
Chinese: ComfyMind 是一个协作式人工智能系统,通过语义工作流接口和搜索树规划机制提升通用生成能力,在多项基准测试中表现优异,性能媲美 GPT-Image-1。
English: ComfyMind is a collaborative AI system that enhances general-purpose generation through its Semantic Workflow Interface and Search Tree Planning mechanism, achieving superior performance on benchmarks and rivaling GPT-Image-1.

Authors:Nikita Ivanov, Mark Klimov, Dmitry Glukhikh, Tatiana Chernysheva, Igor Glukhikh
Title: Track Anything Annotate: Video annotation and dataset generation of computer vision models
Abstract:
Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate
中文摘要:本文提出了一种基于视频跟踪与分割的原型工具,用于自动标注和生成训练数据集,相比人工标注显著加快了处理速度。
English Summary: The paper introduces a prototype tool that uses video tracking and segmentation to automate the annotation and generation of training datasets, significantly speeding up the process compared to manual methods.

Authors:Hongshu Guo, Zeyuan Ma, Yining Ma, Xinglin Zhang, Wei-Neng Chen, Yue-Jiao Gong
Title: DesignX: Human-Competitive Algorithm Designer for Black-Box Optimization
Abstract:
Designing effective black-box optimizers is hampered by limited problem-specific knowledge and manual control that spans months for almost every detail. In this paper, we present DesignX, the first automated algorithm design framework that generates an effective optimizer specific to a given black-box optimization problem within seconds. Rooted in the first principles, we identify two key sub-tasks: 1) algorithm structure generation and 2) hyperparameter control. To enable systematic construction, a comprehensive modular algorithmic space is first built, embracing hundreds of algorithm components collected from decades of research. We then introduce a dual-agent reinforcement learning system that collaborates on structural and parametric design through a novel cooperative training objective, enabling large-scale meta-training across 10k diverse instances. Remarkably, through days of autonomous learning, the DesignX-generated optimizers continuously surpass human-crafted optimizers by orders of magnitude, either on synthetic testbed or on realistic optimization scenarios such as Protein-docking, AutoML and UAV path planning. Further in-depth analysis reveals DesignX's capability to discover non-trivial algorithm patterns beyond expert intuition, which, conversely, provides valuable design insights for the optimization community. We provide DesignX's inference code at https://github.com/MetaEvo/DesignX.
中文摘要:DesignX是一个自动化算法设计框架,通过双智能体强化学习在数秒内生成针对特定黑盒优化问题的有效优化器,其性能在多种实际应用场景中显著超越人工设计的算法。
English Summary: DesignX is an automated framework that rapidly generates specialized black-box optimizers through dual-agent reinforcement learning, consistently outperforming human-designed algorithms across various complex scenarios.

Authors:Ziwei Zhou, Rui Wang, Zuxuan Wu
Title: Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Abstract:
Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.
中文: 近期多模态大语言模型在视觉和音频任务上表现良好,但跨模态同步处理能力不足,为此研究提出了Daily-Omni基准和无需训练的代理,结合视觉与音频模型显著提升了视听整合任务的性能。
English: Recent Multimodal Large Language Models show promising results in visual and audio tasks separately but struggle with cross-modal integration, prompting the development of the Daily-Omni benchmark and a training-free agent that combines visual and audio models to improve performance.

Authors:Hao Wang, Licheng Pan, Zhichao Chen, Xu Chen, Qingyang Dai, Lei Wang, Haoxuan Li, Zhouchen Lin
Title: Time-o1: Time-Series Forecasting Needs Transformed Label Alignment
Abstract:
Training time-series forecast models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimization. To address these challenges, we propose Time-o1, a transformation-augmented learning objective tailored for time-series forecasting. The central idea is to transform the label sequence into decorrelated components with discriminated significance. Models are then trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Extensive experiments demonstrate that Time-o1 achieves state-of-the-art performance and is compatible with various forecast models. Code is available at https://github.com/Master-PLC/Time-o1.
中文:提出的Time-o1方法通过将标签序列转换为去相关的分量来缓解标签自相关并降低任务复杂度,在时间序列预测中实现了最先进的性能。
English: The proposed Time-o1 method transforms label sequences into decorrelated components to mitigate label autocorrelation and reduce task complexity, achieving state-of-the-art performance in time-series forecasting.

Authors:Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Title: Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
Abstract:
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
中文: Trinity-RFT 是一个通用、统一且易用的强化微调框架,采用模块化设计,整合了多种RFT模式,高效处理智能体与环境交互,并提供优化的数据管道,适用于广泛的应用场景和研究开发。
English: Trinity-RFT is a versatile and user-friendly framework for reinforcement fine-tuning of large language models, featuring a modular design that unifies various RFT modes, integrates agent-environment interactions efficiently, and provides optimized data pipelines for diverse applications and research.

Authors:Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Title: Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
Abstract:
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
中文: Trinity-RFT 是一个通用、统一且易用的强化微调框架,采用模块化设计,整合了多种RFT模式,高效处理智能体与环境交互,并提供优化的数据管道,适用于广泛的应用场景和研究开发。
English: Trinity-RFT is a versatile and user-friendly framework for reinforcement fine-tuning of large language models, featuring a modular design that unifies various RFT modes, integrates agent-environment interactions efficiently, and provides optimized data pipelines for diverse applications and research.

Authors:Boxu Chen, Ziwei Zheng, Le Yang, Zeyu Geng, Zhengyu Zhao, Chenhao Lin, Chao Shen
Title: Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.
中文: 大型视觉语言模型存在物体幻觉问题,VaLSe框架通过解释视觉贡献并引导潜在表征来减少幻觉输出,从而提升模型的鲁棒性和可解释性。
English: Large Vision-Language Models suffer from object hallucination, and the proposed VaLSe framework addresses this by interpreting visual contributions and steering latent representations to enhance robustness and interpretability.

Authors:Ping Li, Jianan Ni, Bo Pang
Title: Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Abstract:
Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at https://github.com/mlvccn/BMTC_TransferAttackVid.
中文: 提出的背景混合诱导时序一致性(BMTC)攻击方法通过背景混合和时序梯度一致性,减少了对替代模型的依赖并稳定了攻击方向,从而显著提升了动作识别中对抗样本的迁移性。
English: The proposed Background Mixup-induced Temporal Consistency (BMTC) attack method enhances adversarial transferability for action recognition by reducing dependency on surrogate models and stabilizing attack directions through background mixup and temporal gradient consistency.

Authors:Tazeek Bin Abdur Rakib, Ambuj Mehrish, Lay-Ki Soon, Wern Han Lim, Soujanya Poria
Title: DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors
Abstract:
Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under $3$ turns with success rates exceeding 94\% and, with a larger LLM prior, pushes success above 97\% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at https://github.com/declare-lab/dialogxpert/
Chinese: DialogXpert通过利用冻结大语言模型生成候选行动,并采用紧凑Q网络选择最优决策,显著提升了LLM代理在目标驱动对话中的表现,在不到三轮对话中实现超过94%的成功率,同时融入情感智能以建立共情连接。
English: DialogXpert enhances LLM agents for proactive, goal-driven dialogues by using a frozen LLM to generate candidate actions and a compact Q-network to select optimal moves, achieving over 94% success rates in under three turns while incorporating emotional intelligence for empathetic interactions.

Authors:Peilin Chen, Xiaoxuan Yang
Title: Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration
Abstract:
Large language models (LLMs) have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and parallelism dataflow is designed to reduce the first token generation time. Experiments show that Titanus achieves 159.9x (49.6x) and 34.8x (29.2x) energy efficiency (throughput) compared to Nvidia A100 GPU and FlightLLM respectively. The code for Titanus is available at https://github.com/peilin-chen/Titanus-for-LLM-acceleration.
中文: Titanus是一种软硬件协同设计系统,通过级联剪枝-量化和分层量化策略高效压缩大语言模型中的键值缓存,相比现有解决方案显著提升了能效和吞吐量。
English: Titanus is a software-hardware co-design system that efficiently compresses the key-value cache in large language models through cascade pruning-quantization and hierarchical quantization strategies, achieving significant energy efficiency and throughput improvements over existing solutions.

Authors:Ildi Alla, Valeria Loscri
Title: Sec5GLoc: Securing 5G Indoor Localization via Adversary-Resilient Deep Learning Architecture
Abstract:
Emerging 5G millimeter-wave and sub-6 GHz networks enable high-accuracy indoor localization, but security and privacy vulnerabilities pose serious challenges. In this paper, we identify and address threats including location spoofing and adversarial signal manipulation against 5G-based indoor localization. We formalize a threat model encompassing attackers who inject forged radio signals or perturb channel measurements to mislead the localization system. To defend against these threats, we propose an adversary-resilient localization architecture that combines deep learning fingerprinting with physical domain knowledge. Our approach integrates multi-anchor Channel Impulse Response (CIR) fingerprints with Time Difference of Arrival (TDoA) features and known anchor positions in a hybrid Convolutional Neural Network (CNN) and multi-head attention network. This design inherently checks geometric consistency and dynamically down-weights anomalous signals, making localization robust to tampering. We formulate the secure localization problem and demonstrate, through extensive experiments on a public 5G indoor dataset, that the proposed system achieves a mean error approximately 0.58 m under mixed Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) trajectories in benign conditions and gracefully degrades to around 0.81 m under attack scenarios. We also show via ablation studies that each architecture component (attention mechanism, TDoA, etc.) is critical for both accuracy and resilience, reducing errors by 4-5 times compared to baselines. In addition, our system runs in real-time, localizing the user in just 1 ms on a simple CPU. The code has been released to ensure reproducibility (https://github.com/sec5gloc/Sec5GLoc).
中文: 本文提出了一种安全的5G室内定位系统,通过融合深度学习与物理信号特征来防御位置欺骗和信号篡改攻击,在受攻击场景下仍能保持高精度和实时定位能力。
English: This paper proposes a secure 5G indoor localization system that combines deep learning with physical signal features to defend against location spoofing and signal manipulation, achieving high accuracy and real-time performance even under attacks.

Authors:Yanping Fu, Xinyuan Liu, Tianyu Li, Yike Ma, Yucheng Zhang, Feng Dai
Title: TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving
Abstract:
Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.
中文: TopoPoint是一种新颖的框架,通过显式检测车道端点并与车道联合推理来增强自动驾驶中的拓扑推理能力,在基准测试中实现了最先进的性能。
English: TopoPoint is a novel framework that explicitly detects lane endpoints and jointly reasons over them with lanes to enhance topology reasoning in autonomous driving, achieving state-of-the-art performance on benchmarks.

Authors:Patrick Leask, Neel Nanda, Noura Al Moubayed
Title: Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Abstract:
Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of the data. This allowed us to train ITDAs on Llama-3.1 70B and 405B on a single consumer GPU. ITDAs can achieve similar reconstruction performance to SAEs on some target LLMs, but generally incur a performance penalty. However, ITDA dictionaries enable cross-model comparisons, and a simple Jaccard similarity index on ITDA dictionaries outperforms existing methods like CKA, SVCCA, and relative representation similarity metrics. ITDAs provide a cheap alternative to SAEs where computational resources are limited, or when cross model comparisons are necessary. Code available at https://github.com/pleask/itda.
中文: ITDA模型作为稀疏自编码器的高效替代方案,能以极低的计算成本分解大语言模型激活,并支持跨模型比较。
English: ITDA models offer a computationally efficient alternative to sparse autoencoders for decomposing LLM activations, enabling cross-model comparisons with minimal training time and data.

Authors:Ziyu Ge, Yuhao Wu, Daniel Wai Kit Chin, Roy Ka-Wei Lee, Rui Cao
Title: Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs
Abstract:
Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at https://github.com/zoeyyes/CONFACT), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.
中文摘要:检索增强语言模型在事实核查任务中潜力显著但处理冲突证据时可靠性下降,为此我们开发了CONFACT数据集并提出通过整合信息来源可信度来有效提升模型解决证据冲突能力的方法。
English Summary: Retrieval-augmented language models show promise for fact-checking but struggle with conflicting evidence, leading to the creation of the CONFACT dataset and methods that improve performance by incorporating source credibility information.

Authors:M. Emre Sahin, Edoardo Altamura, Oscar Wallis, Stephen P. Wood, Anton Dekusar, Declan A. Millar, Takashi Imamichi, Atsushi Matsuo, Stefano Mensa
Title: Qiskit Machine Learning: an open-source library for quantum machine learning tasks at scale on quantum hardware and classical simulators
Abstract:
We present Qiskit Machine Learning (ML), a high-level Python library that combines elements of quantum computing with traditional machine learning. The API abstracts Qiskit's primitives to facilitate interactions with classical simulators and quantum hardware. Qiskit ML started as a proof-of-concept code in 2019 and has since been developed to be a modular, intuitive tool for non-specialist users while allowing extensibility and fine-tuning controls for quantum computational scientists and developers. The library is available as a public, open-source tool and is distributed under the Apache version 2.0 license.
中文: Qiskit机器学习是一个Python库,它将量子计算与传统机器学习相结合,为普通用户和专家提供了直观的API,以便与模拟器和量子硬件进行交互。
English: Qiskit Machine Learning is a Python library that integrates quantum computing with classical machine learning, offering an intuitive API for both general users and experts to work with simulators and quantum hardware.

Authors:Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, Guiguang Ding
Title: Fast Quiet-STaR: Thinking Without Thought Tokens
Abstract:
Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9\% on Mistral 7B and 5.7\% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at https://github.com/huangwei200012/Fast-Quiet-STaR.
中文:Fast Quiet STaR通过课程学习和强化学习优化推理过程,在保持相同推理速度的同时显著提升了多个模型的准确率。
English: Fast Quiet STaR enhances reasoning efficiency by reducing computational overhead through curriculum learning and reinforcement learning, achieving higher accuracy without increasing inference time.

Authors:Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo, Wenjie Qiu, Sijie Ma, Hongqiao Lian, Jiajun Zhan, Kaixu Chen, Chen Wang, Zhiyang Huang, Zechuan Huang, Guojun Peng, Ran Cheng, Yining Ma
Title: MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization
Abstract:
Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (https://github.com/MetaEvo/MetaBox) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce 23 up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by 10-40x; 3) a comprehensive benchmark suite of 18 synthetic/realistic tasks (1900+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.
中文:MetaBox-v2作为升级版开源框架,通过统一架构支持多种优化方法,显著提升并行效率,并扩展了涵盖多类优化场景的基准测试,为实践提供全面分析工具。
English: MetaBox-v2 is an upgraded open-source framework that introduces a unified architecture supporting multiple optimization approaches, significantly improves efficiency with faster parallelization, and expands benchmarking to diverse optimization scenarios for comprehensive analysis.

Authors:Dong-Hee Kim, Hyunjee Song, Donghyun Kim
Title: SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data
Abstract:
Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at https://github.com/UTLLab/SynRES.
Chinese: WildRES作为一个新基准,通过引入复杂查询和多领域场景严格评估指代表达分割模型,揭示了现有模型的性能局限,并推动了SynRES自动化合成训练数据流程的开发,从而显著提升了模型的精确度。
English: WildRES is a new benchmark that introduces complex queries and diverse domains to rigorously evaluate referring expression segmentation models, revealing their performance limitations and prompting the development of SynRES, an automated pipeline that generates synthetic training data to significantly enhance model accuracy.

Authors:Yunyao Lu, Yihang Wu, Reem Kateb, Ahmad Chaddad
Title: Semi-Supervised Medical Image Segmentation via Dual Networks
Abstract:
Traditional supervised medical image segmentation models require large amounts of labeled data for training; however, obtaining such large-scale labeled datasets in the real world is extremely challenging. Recent semi-supervised segmentation models also suffer from noisy pseudo-label issue and limited supervision in feature space. To solve these challenges, we propose an innovative semi-supervised 3D medical image segmentation method to reduce the dependency on large, expert-labeled datasets. Furthermore, we introduce a dual-network architecture to address the limitations of existing methods in using contextual information and generating reliable pseudo-labels. In addition, a self-supervised contrastive learning strategy is used to enhance the representation of the network and reduce prediction uncertainty by distinguishing between reliable and unreliable predictions. Experiments on clinical magnetic resonance imaging demonstrate that our approach outperforms state-of-the-art techniques. Our code is available at https://github.com/AIPMLab/Semi-supervised-Segmentation.
Chinese: 本研究提出了一种创新的半监督三维医学图像分割方法,采用双网络架构和对比学习策略,减少对标注数据的依赖并提高分割精度,在临床磁共振成像测试中优于现有技术。
English: This study introduces a novel semi-supervised 3D medical image segmentation method with a dual-network architecture and contrastive learning to reduce reliance on labeled data and improve segmentation accuracy, outperforming existing techniques in clinical MRI tests.

Authors:Dan Yuan, Yi Feng, Ziyun Tang
Title: Dual Attention Residual U-Net for Accurate Brain Ultrasound Segmentation in IVH Detection
Abstract:
Intraventricular hemorrhage (IVH) is a severe neurological complication among premature infants, necessitating early and accurate detection from brain ultrasound (US) images to improve clinical outcomes. While recent deep learning methods offer promise for computer-aided diagnosis, challenges remain in capturing both local spatial details and global contextual dependencies critical for segmenting brain anatomies. In this work, we propose an enhanced Residual U-Net architecture incorporating two complementary attention mechanisms: the Convolutional Block Attention Module (CBAM) and a Sparse Attention Layer (SAL). The CBAM improves the model's ability to refine spatial and channel-wise features, while the SAL introduces a dual-branch design, sparse attention filters out low-confidence query-key pairs to suppress noise, and dense attention ensures comprehensive information propagation. Extensive experiments on the Brain US dataset demonstrate that our method achieves state-of-the-art segmentation performance, with a Dice score of 89.04% and IoU of 81.84% for ventricle region segmentation. These results highlight the effectiveness of integrating spatial refinement and attention sparsity for robust brain anatomy detection. Code is available at: https://github.com/DanYuan001/BrainImgSegment.
中文摘要:本研究提出一种结合CBAM和SAL双重注意力机制的增强型残差U-Net,在早产儿脑部超声图像心室分割任务中以89.04%的Dice分数达到最优性能。
English Summary: The study introduces an enhanced Residual U-Net with dual attention mechanisms, CBAM and SAL, achieving state-of-the-art ventricle segmentation in premature infant brain ultrasound images with 89.04% Dice score.

Authors:Xuerui Qiu, Peixi Wu, Yaozhi Wen, Shaowei Gu, Yuqi Pan, Xinhao Luo, Bo XU, Guoqi Li
Title: SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding
Abstract:
Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.
中文摘要:提出的脉冲视觉语言(SVL)框架通过多模态预训练解决了脉冲神经网络性能不足的问题,在保持能效优势的同时实现了卓越的零样本三维分类和开放世界理解能力。
English Summary: The proposed Spike-based Vision-Language (SVL) framework overcomes SNNs' performance limitations through multimodal pretraining, achieving superior zero-shot 3D classification and open-world understanding while maintaining energy efficiency.

Authors:Jiawei Du, Jinlong Wu, Yuzheng Chen, Yucheng Hu, Bing Li, Joey Tianyi Zhou
Title: Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution
Abstract:
Most LLM-based agent frameworks adopt a top-down philosophy: humans decompose tasks, define workflows, and assign agents to execute each step. While effective on benchmark-style tasks, such systems rely on designer updates and overlook agents' potential to learn from experience. Recently, Silver and Sutton(2025) envision a shift into a new era, where agents could progress from a stream of experiences. In this paper, we instantiate this vision of experience-driven learning by introducing a bottom-up agent paradigm that mirrors the human learning process. Agents acquire competence through a trial-and-reasoning mechanism-exploring, reflecting on outcomes, and abstracting skills over time. Once acquired, skills can be rapidly shared and extended, enabling continual evolution rather than static replication. As more agents are deployed, their diverse experiences accelerate this collective process, making bottom-up design especially suited for open-ended environments. We evaluate this paradigm in Slay the Spire and Civilization V, where agents perceive through raw visual inputs and act via mouse outputs, the same as human players. Using a unified, game-agnostic codebase without any game-specific prompts or privileged APIs, our bottom-up agents acquire skills entirely through autonomous interaction, demonstrating the potential of the bottom-up paradigm in complex, real-world environments. Our code is available at https://github.com/AngusDujw/Bottom-Up-Agent.
中文摘要:本文提出了一种自下而上的智能体范式,通过试错推理机制实现自主技能获取,使智能体能够从经验中学习并共享技能,在无需游戏特定编程的复杂游戏环境中验证了其有效性。
English Summary: This paper introduces a bottom-up agent paradigm that enables autonomous skill acquisition through trial-and-reasoning, allowing agents to learn from experiences and share skills collectively, demonstrating effectiveness in complex gaming environments without game-specific programming.

Authors:Tianheng Ling, Chao Qian, Lukas Johannes Haßler, Gregor Schiele
Title: Automating Versatile Time-Series Analysis with Tiny Transformers on Embedded FPGAs
Abstract:
Transformer-based models have shown strong performance across diverse time-series tasks, but their deployment on resource-constrained devices remains challenging due to high memory and computational demand. While prior work targeting Microcontroller Units (MCUs) has explored hardware-specific optimizations, such approaches are often task-specific and limited to 8-bit fixed-point precision. Field-Programmable Gate Arrays (FPGAs) offer greater flexibility, enabling fine-grained control over data precision and architecture. However, existing FPGA-based deployments of Transformers for time-series analysis typically focus on high-density platforms with manual configuration. This paper presents a unified and fully automated deployment framework for Tiny Transformers on embedded FPGAs. Our framework supports a compact encoder-only Transformer architecture across three representative time-series tasks (forecasting, classification, and anomaly detection). It combines quantization-aware training (down to 4 bits), hardware-aware hyperparameter search using Optuna, and automatic VHDL generation for seamless deployment. We evaluate our framework on six public datasets across two embedded FPGA platforms. Results show that our framework produces integer-only, task-specific Transformer accelerators achieving as low as 0.033 mJ per inference with millisecond latency on AMD Spartan-7, while also providing insights into deployment feasibility on Lattice iCE40. All source code will be released in the GitHub repository (https://github.com/Edwina1030/TinyTransformer4TS).
中文摘要:本文提出了一种自动化框架,可在嵌入式FPGA上部署精简量化Transformer,通过硬件感知优化在多种时序任务中实现低功耗推理与高效性能。
English Summary: This paper introduces an automated framework for deploying compact, quantized Transformers on embedded FPGAs, achieving low-energy inference across multiple time-series tasks with optimized hardware performance.

Authors:Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen
Title: Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling
Abstract:
Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings. Our code is available at https://github.com/XiaolongTang23/Plan-R1.
中文: 针对现有基于学习的规划器依赖专家数据且存在安全风险的问题,Plan-R1提出两阶段框架:先通过人类数据预训练轨迹预测器,再采用VD-GRPO进行对齐安全与交规的微调,在nuPlan基准测试中实现了最优性能。
English: To address the limitations of existing learning-based planners that rely on expert data with potential safety risks, Plan-R1 introduces a two-stage framework that first pre-trains a trajectory predictor on human data and then fine-tunes it using VD-GRPO to align with safety and traffic rules, achieving superior performance on the nuPlan benchmark.

Authors:Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, Yong Wu
Title: GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
Abstract:
Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.
中文: 本文提出GeoGramBench基准,评估大语言模型将程序化绘图代码转化为几何推理的能力,发现即使最先进模型在最高抽象级别准确率也不足50%。
English: This paper introduces GeoGramBench, a benchmark evaluating large language models' ability to translate procedural drawing code into geometric reasoning, revealing that even top models struggle with less than 50% accuracy at high abstraction levels.

Authors:Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Title: PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval
Abstract:
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.
Chinese: PreMoe框架通过概率性专家剪枝和任务自适应专家检索技术,仅动态加载任务关键专家,在显著降低内存占用的同时保持模型精度,实现了大规模混合专家模型在内存受限环境中的高效部署。
English: PreMoe introduces probabilistic expert pruning and task-adaptive expert retrieval to enable efficient deployment of large MoE models in memory-constrained environments by dynamically loading only task-critical experts, achieving near-original accuracy with dramatically reduced memory usage.

Authors:Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe
Title: GIM: Improved Interpretability for Large Language Models
Abstract:
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.
中文: 该摘要介绍了梯度交互修正(GIM)这一新技术,它通过解决注意力机制中的自我修复现象,显著提升了大型语言模型可解释方法的可靠性,并在多种模型和任务中验证了其有效性。
English: The abstract introduces Gradient Interaction Modifications (GIM), a novel technique that addresses self-repair within attention mechanisms to enhance the faithfulness of interpretability methods in large language models, demonstrating significant improvements across various models and tasks.

Authors:Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Casper L. Christensen, Jing Huang, Lars Maaløe
Title: GIM: Improved Interpretability for Large Language Models
Abstract:
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.
中文: 该摘要介绍了梯度交互修正(GIM)这一新技术,它通过解决注意力机制中的自我修复现象,显著提升了大型语言模型可解释方法的可靠性,并在多种模型和任务中验证了其有效性。
English: The abstract introduces Gradient Interaction Modifications (GIM), a novel technique that addresses self-repair within attention mechanisms to enhance the faithfulness of interpretability methods in large language models, demonstrating significant improvements across various models and tasks.

Authors:Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang
Title: Distilling LLM Agent into Small Models with Retrieval and Code Tools
Abstract:
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.
中文: 本文提出智能体蒸馏框架,通过改进的思维轨迹生成和自洽行动方法,将大型语言模型的完整推理与工具使用能力迁移至小型模型,使0.5B-3B参数的小模型能达到更大模型的性能水平。
English: This paper introduces Agent Distillation, a framework that transfers comprehensive reasoning and tool-using capabilities from large language models to smaller ones through enhanced trajectory generation and self-consistent action methods, enabling compact 0.5B-3B models to match larger counterparts' performance.

Authors:Zixian Guo, Ming Liu, Qilong Wang, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng Zuo
Title: Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving
Abstract:
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Effective alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: https://github.com/guozix/DVLR.
Chinese: 本文提出了一种解耦推理框架,通过视觉语言模型将图像转化为文本描述,再利用语言模型进行推理,相比端到端训练提供了更经济高效的解决方案,并在视觉语言任务中表现更优。
English: This paper proposes a decoupled reasoning framework that uses a vision-language model to convert images into text descriptions and an LLM for reasoning, offering a cost-efficient alternative to end-to-end training and achieving superior performance on vision-language tasks.

Authors:Linbao Li, Yannan Liu, Daojing He, Yu Li
Title: One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
Abstract:
Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at https://github.com/LLBao/ArrAttack.
Chinese: ArrAttack是一种新颖的越狱攻击方法,能自动生成绕过多种防御机制的鲁棒提示,在各类大语言模型上显著优于现有攻击策略并展现出强大的迁移能力。
English: ArrAttack is a novel jailbreak method that automatically generates robust prompts to bypass various defenses in large language models, significantly outperforming existing attacks and demonstrating strong transferability across models.

Authors:Hainuo Wang, Qiming Hu, Xiaojie Guo
Title: MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery
Abstract:
Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at https://github.com/hainuo-wang/MODEM.git.
中文: 本研究提出的MODEM框架通过莫顿序降解估计机制,结合空间扫描与双路径估计模块,自适应地建模全局和局部天气退化特征,在多种恶劣天气图像恢复任务中取得了最先进的性能表现。
English: The proposed MODEM framework introduces a Morton-Order Degradation Estimation Mechanism that adaptively captures both global and local weather-induced degradations through spatial ordering and dual estimation modules, achieving state-of-the-art restoration performance across diverse adverse weather conditions.

Authors:Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, Guo-jun Qi
Title: InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
Abstract:
Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$\times$, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.
中文:InfLVG是一种推理时框架,通过动态选择相关上下文标记实现连贯的长视频生成,可将视频长度扩展至9倍,同时保持跨场景的一致性和语义保真度。
English: InfLVG is an inference-time framework that enables coherent long video generation by dynamically selecting relevant context tokens, extending video length up to 9 times while maintaining consistency and semantic fidelity across scenes.

Authors:Qiyu Chen, Huiyuan Luo, Haiming Yao, Wei Luo, Zhen Qu, Chengkan Lv, Zhengtao Zhang
Title: Center-aware Residual Anomaly Synthesis for Multi-class Industrial Anomaly Detection
Abstract:
Anomaly detection plays a vital role in the inspection of industrial images. Most existing methods require separate models for each category, resulting in multiplied deployment costs. This highlights the challenge of developing a unified model for multi-class anomaly detection. However, the significant increase in inter-class interference leads to severe missed detections. Furthermore, the intra-class overlap between normal and abnormal samples, particularly in synthesis-based methods, cannot be ignored and may lead to over-detection. To tackle these issues, we propose a novel Center-aware Residual Anomaly Synthesis (CRAS) method for multi-class anomaly detection. CRAS leverages center-aware residual learning to couple samples from different categories into a unified center, mitigating the effects of inter-class interference. To further reduce intra-class overlap, CRAS introduces distance-guided anomaly synthesis that adaptively adjusts noise variance based on normal data distribution. Experimental results on diverse datasets and real-world industrial applications demonstrate the superior detection accuracy and competitive inference speed of CRAS. The source code and the newly constructed dataset are publicly available at https://github.com/cqylunlun/CRAS.
中文摘要:提出的中心感知残差异常合成(CRAS)方法通过统一中心耦合缓解类间干扰,并采用自适应异常合成减少类内重叠,有效解决了多类别异常检测中的关键问题,在多个数据集上展现出卓越的检测精度与高效性能。
English Summary: The proposed Center-aware Residual Anomaly Synthesis (CRAS) method addresses multi-class anomaly detection challenges by mitigating inter-class interference through unified center coupling and reducing intra-class overlap via adaptive anomaly synthesis, demonstrating superior accuracy and efficiency across various datasets.

Authors:Xiaoyu Ye, Songjie Cheng, Yongtao Wang, Yajiao Xiong, Yishen Li
Title: T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models
Abstract:
Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their ability to produce explicit or harmful content raises concerns about misuse and potential rights violations. Inspired by the success of unlearning techniques in erasing undesirable concepts from text-to-image (T2I) models, we extend unlearning to T2V models and propose a robust and precise unlearning method. Specifically, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against LLM-refined prompts. To achieve precise unlearning, we incorporate a localization and a preservation regularization to preserve the model's ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model's generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{https://github.com/VDIGPKU/T2VUnlearning.git}{https://github.com/VDIGPKU/T2VUnlearning.git}.
中文: 本研究提出了一种基于遗忘学习的方法,通过增强的微调和正则化技术从文本到视频扩散模型中消除有害概念,在保持非目标内容生成质量的同时,其效果优于现有方法。
English: This study introduces an unlearning-based method to erase harmful concepts from text-to-video diffusion models, using enhanced fine-tuning and regularization techniques to maintain generation quality for non-target content while outperforming existing approaches.

Authors:Xiaoyu Ye, Songjie Cheng, Yongtao Wang, Yajiao Xiong, Yishen Li
Title: T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models
Abstract:
Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their capability to produce explicit or harmful content introduces new challenges related to misuse and potential rights violations. To address this newly emerging threat, we propose unlearning-based concept erasing as a solution. First, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against prompts refined by large language models (LLMs). Second, to achieve precise unlearning, we incorporate mask-based localization regularization and concept preservation regularization to preserve the model's ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model's generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{https://github.com/VDIGPKU/T2VUnlearning.git}{https://github.com/VDIGPKU/T2VUnlearning.git}.
中文: 本研究提出了一种基于遗忘学习的方法,通过增强的微调和正则化技术从文本到视频扩散模型中消除有害概念,在保持非目标内容生成质量的同时,其效果优于现有方法。
English: This study introduces an unlearning-based method to erase harmful concepts from text-to-video diffusion models, using enhanced fine-tuning and regularization techniques to maintain generation quality for non-target content while outperforming existing approaches.

Authors:Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji
Title: RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
Abstract:
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.
中文: RePrompt通过强化学习框架,在提示增强过程中引入结构化推理和图像级优化,无需人工标注即可显著提升文本到图像生成的空间布局忠实度和组合泛化能力。
English: RePrompt introduces a reinforcement learning framework that enhances text-to-image prompts through structured reasoning and image-level optimization, achieving superior spatial and compositional accuracy without human annotations.

Authors:Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, Chao Ma
Title: Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Abstract:
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
中文摘要:本文提出CoRL协同强化学习框架,通过统一策略优化同时增强多模态大语言模型的生成与理解能力,在多项基准测试中取得显著性能提升。
English Summary: This paper introduces CoRL, a co-reinforcement learning framework that enhances both generation and understanding in multimodal large language models, achieving significant performance gains across multiple benchmarks.

Authors:Vendi Ardianto Nugroho, Byung Moo Lee
Title: GPS-Aided Deep Learning for Beam Prediction and Tracking in UAV mmWave Communication
Abstract:
Millimeter-wave (mmWave) communication enables high data rates for cellular-connected Unmanned Aerial Vehicles (UAVs). However, a robust beam management remains challenging due to significant path loss and the dynamic mobility of UAVs, which can destabilize the UAV-base station (BS) link. This research presents a GPS-aided deep learning (DL) model that simultaneously predicts current and future optimal beams for UAV mmWave communications, maintaining a Top-1 prediction accuracy exceeding 70% and an average power loss below 0.6 dB across all prediction steps. These outcomes stem from a proposed data set splitting method ensuring balanced label distribution, paired with a GPS preprocessing technique that extracts key positional features, and a DL architecture that maps sequential position data to beam index predictions. The model reduces overhead by approximately 93% (requiring the training of 2 ~ 3 beams instead of 32 beams) with 95% beam prediction accuracy guarantees, and ensures 94% to 96% of predictions exhibit mean power loss not exceeding 1 dB.
Chinese: 本研究提出一种GPS辅助的深度学习模型,用于预测无人机毫米波通信的最佳波束,实现了超过70%的准确率,在将开销降低93%的同时保持了极低的功率损耗。
English: This study introduces a GPS-assisted deep learning model that predicts optimal beams for UAV mmWave communications, achieving over 70% accuracy and reducing overhead by 93% while maintaining minimal power loss.

Authors:Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang
Title: Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing
Abstract:
De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called \underline{\textbf{L}}atent \underline{\textbf{I}}mputation before \underline{\textbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at \href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}.
中文: LIPNovo提出了一种在肽段预测前对质谱中缺失碎片信息进行潜在空间补偿的新方法,在多个基准数据集上大幅超越了现有最优技术。
English: LIPNovo introduces a latent imputation method that compensates for missing fragmentation data in mass spectra before peptide prediction, significantly outperforming existing techniques on benchmark datasets.

Authors:Rafał Karczewski, Markus Heinonen, Alison Pouplin, Søren Hauberg, Vikas Garg
Title: Spacetime Geometry of Denoising in Diffusion Models
Abstract:
We present a novel perspective on diffusion models using the framework of information geometry. We show that the set of noisy samples, taken across all noise levels simultaneously, forms a statistical manifold -- a family of denoising probability distributions. Interpreting the noise level as a temporal parameter, we refer to this manifold as spacetime. This manifold naturally carries a Fisher-Rao metric, which defines geodesics -- shortest paths between noisy points. Notably, this family of distributions is exponential, enabling efficient geodesic computation even in high-dimensional settings without retraining or fine-tuning. We demonstrate the practical value of this geometric viewpoint in transition path sampling, where spacetime geodesics define smooth sequences of Boltzmann distributions, enabling the generation of continuous trajectories between low-energy metastable states. Code is available at: https://github.com/Aalto-QuML/diffusion-spacetime-geometry.
中文摘要:本研究提出了一种基于信息几何框架的扩散模型新视角,揭示了噪声样本构成具有Fisher-Rao度量的统计流形,可实现高效测地线计算,并在过渡路径采样等应用中展现实用价值。
English Summary: This study introduces a novel information geometry framework for diffusion models, revealing that noisy samples form a statistical manifold with a Fisher-Rao metric that enables efficient geodesic computation for practical applications like transition path sampling.

Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew C Yao
Title: On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Abstract:
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.
策略梯度算法提升了大型语言模型的推理能力,本研究提出的正则化策略梯度(RPG)框架统一了KL正则化变体,修正了离策略加权问题,并在数学推理基准测试中显著提高了训练稳定性和准确性。
Policy gradient algorithms enhance large language models' reasoning, and this study introduces the Regularized Policy Gradient (RPG) framework to unify KL regularization variants, correct off-policy weighting issues, and improve training stability and accuracy on benchmarks.

Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
Title: On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Abstract:
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.
策略梯度算法提升了大型语言模型的推理能力,本研究提出的正则化策略梯度(RPG)框架统一了KL正则化变体,修正了离策略加权问题,并在数学推理基准测试中显著提高了训练稳定性和准确性。
Policy gradient algorithms enhance large language models' reasoning, and this study introduces the Regularized Policy Gradient (RPG) framework to unify KL regularization variants, correct off-policy weighting issues, and improve training stability and accuracy on benchmarks.

Authors:Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua
Title: L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Abstract:
Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code is available at https://github.com/Xiaohao-Liu/L-MTP.
中文摘要:提出的跳跃多标记预测(L-MTP)方法通过单次前向传播预测非连续标记,有效增强了语言模型的远程依赖捕捉能力并加速了推理过程。
English Summary: The proposed leap multi-token prediction (L-MTP) method enhances language models by predicting non-sequential tokens in a single pass, improving both contextual understanding and inference speed.

Authors:Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung
Title: CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
Abstract:
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
中文: 该摘要介绍了CReSt这一综合性基准,旨在整体评估大语言模型在检索增强生成场景中的复杂推理、拒答能力、引用准确性和文档布局理解等关键维度,研究表明即使先进模型在这些方面仍存在明显不足。
English: This abstract introduces CReSt, a comprehensive benchmark designed to holistically evaluate Large Language Models in Retrieval-Augmented Generation scenarios, focusing on complex reasoning, refusal capabilities, citation accuracy, and document layout understanding, revealing that even advanced models struggle across these dimensions.

Authors:Uyoung Jeong, Jonathan Freer, Seungryul Baek, Hyung Jin Chang, Kwang In Kim
Title: PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
Abstract:
We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: https://github.com/uyoung-jeong/PoseBH.
中文: 我们提出PoseBH这一新型多数据集训练框架,通过统一关键点嵌入和跨类型自监督机制解决姿态估计中的骨骼异构问题,在人类和动物数据集上显著提升泛化能力,同时保持标准基准的性能表现。
English: We introduce PoseBH, a novel multi-dataset training framework that addresses skeletal heterogeneity in pose estimation through unified keypoint embeddings and cross-type self-supervision, significantly improving generalization across human and animal datasets while maintaining performance on standard benchmarks.

Authors:Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong
Title: CLIMB: Class-imbalanced Learning Benchmark on Tabular Data
Abstract:
Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at https://github.com/ZhiningLiu1998/imbalanced-ensemble.
中文: 本文提出了CLIMB,一个针对表格数据类别不平衡学习的全面基准,包含73个数据集和29种算法,提供了方法有效性和效率的实用见解,同时揭示了简单重平衡的局限性和数据质量的重要性。
English: This paper introduces CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data that includes 73 datasets and 29 algorithms, offering practical insights into method effectiveness and efficiency while highlighting the limitations of naive rebalancing and the importance of data quality.

Authors:Hefei Mei, Zirui Wang, Shen You, Minjing Dong, Chang Xu
Title: VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM
中文摘要:本文提出的视觉编码器攻击(VEAttack)通过仅针对大型视觉语言模型的视觉编码器进行攻击,在显著降低多模态任务性能的同时有效减少了计算开销。
English Summary: The proposed Vision Encoder Attack (VEAttack) effectively compromises Large Vision-Language Models by targeting only the vision encoder, achieving significant performance degradation in multimodal tasks while reducing computational costs.

Authors:Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang
Title: UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information
Abstract:
The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec.
中文: 提出的DistilCodec和UniTTS通过将多码本音频编解码器蒸馏为高利用率单码本格式,并结合全面音频建模,解决了现有方法在语义与声学信息不对齐时的局限性,实现了增强的文本转语音系统,同时保持了语言模型能力并扩展了训练数据多样性。
English: The proposed DistilCodec and UniTTS address limitations in multi-codebook audio codecs by distilling them into a single-codebook format with high code utilization and integrating comprehensive audio modeling, enabling enhanced text-to-speech systems with preserved language model capabilities and expanded training data diversity.

Authors:Wei Jie Yeo, Rui Mao, Moloud Abdar, Erik Cambria, Ranjan Satapathy
Title: Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads
Abstract:
Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.
Chinese: 提出的“定位后修正”框架通过针对特定注意力头识别并缓解CLIP模型中的伪关联,在存在偏见的基准测试中显著提升了最差组准确率。
English: The proposed \textsc{Locate-Then-Correct} framework identifies and mitigates spurious associations in CLIP models by targeting specific attention heads, significantly improving worst-group accuracy on biased benchmarks.

Authors:Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng
Title: FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
Abstract:
Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.
中文摘要:FullFront是一个评估多模态大语言模型在前端开发全流程中性能的综合基准,通过三阶段任务测试发现现有模型在页面感知、代码生成和交互实现方面与人类专家存在显著差距。
English Summary: FullFront is a comprehensive benchmark designed to evaluate Multimodal Large Language Models across the entire front-end development pipeline, revealing significant performance gaps compared to human experts in webpage perception, code generation, and interaction implementation.

Authors:Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du
Title: Towards VM Rescheduling Optimization Through Deep Reinforcement Learning
Abstract:
Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.
中文: 现代数据中心因虚拟机管理导致资源碎片化,为此开发了VM2RL强化学习系统,通过高效的重调度优化性能并显著缩短运行时间。
English: Modern data centers face challenges with resource fragmentation from virtual machine (VM) management, prompting the development of VM2RL, a reinforcement learning system that efficiently reschedules VMs to optimize performance and reduce runtime.

Authors:Yuheng Wu, Jianwen Xie, Denghui Zhang, Zhaozhuo Xu
Title: DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
Abstract:
Theory-of-Mind (ToM) tasks pose a unique challenge for large language models (LLMs), which often lack the capability for dynamic logical reasoning. In this work, we propose DEL-ToM, a framework that improves verifiable ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and verifiable dynamic logical reasoning. We use data generated automatically via a DEL simulator to train a verifier, which we call the Process Belief Model (PBM), to score each belief update step. During inference, the PBM evaluates candidate belief traces from the LLM and selects the highest-scoring one. This allows LLMs to allocate extra inference-time compute to yield more transparent reasoning. Experiments across model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision significantly enhances LLMs' ToM capabilities without retraining. Code is available at https://github.com/joel-wu/DEL-ToM.
中文: DEL-ToM通过将心理理论任务分解为基于动态认知逻辑的可验证信念更新,并利用过程信念模型进行推理时评分,无需重新训练即可显著提升大型语言模型的推理能力。
English: DEL-ToM enhances large language models' Theory-of-Mind reasoning by decomposing tasks into verifiable belief updates using Dynamic Epistemic Logic and a Process Belief Model for inference-time scoring, improving performance without retraining.

Authors:Hitesh Laxmichand Patel, Amit Agarwal, Arion Das, Bhargava Kumar, Srikant Panda, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae
Title: SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Abstract:
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
中文: 企业采用大语言模型处理通信任务时需确保其理解文化背景并安全回应,为此推出SweEval基准测试模型对不当指令的遵循情况,以评估伦理对齐并降低风险。
English: Enterprises are adopting LLMs for communication tasks but need them to handle cultural contexts safely, so SweEval benchmark tests model compliance with inappropriate instructions to ensure ethical alignment and reduce risks.

Authors:Amit Agarwal, Srikant Panda, Kulbhushan Pachauri
Title: FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Abstract:
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag
中文: FS-DAG是一种针对少样本场景下视觉丰富文档理解的可扩展高效模型,通过模块化领域专用架构以不足9000万参数处理OCR错误和领域偏移,在信息抽取任务中展现出卓越性能和更快收敛速度。
English: FS-DAG is a scalable and efficient model for visually rich document understanding in few-shot settings, leveraging modular domain-specific backbones to handle OCR errors and domain shifts with under 90M parameters, demonstrating superior performance and faster convergence in information extraction tasks.

Authors:Harim Kim, Yuhan Wang, Minkyu Ahn, Heeyoul Choi, Yuyin Zhou, Charmgil Hong
Title: Harnessing EHRs for Diffusion-based Anomaly Detection on Chest X-rays
Abstract:
Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model's ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL https://github.com/nth221/Diff3M
中文摘要:Diff3M是一种创新的多模态框架,通过将胸部X光与电子健康记录相结合,并利用交叉注意力模块增强医学影像中的无监督异常检测,在基准数据集上实现了最先进的性能。
English Summary: Diff3M is a novel multi-modal framework that enhances unsupervised anomaly detection in medical imaging by integrating chest X-rays with Electronic Health Records through a cross-attention module, achieving state-of-the-art performance on benchmark datasets.

Authors:Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Title: SELF: Self-Extend the Context Length With Logistic Growth Function
Abstract:
Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.
中文:SELF方法通过逻辑函数对连续标记进行分组来扩展语言模型的有效上下文长度,在多项长文本任务中相比现有方法最高实现了12%的性能提升。
English: The SELF method extends language models' effective context length by grouping tokens with a logistic function, achieving performance improvements of up to 12% over existing methods on various long-context tasks.

Authors:Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
Title: JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model
Abstract:
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA
中文摘要:JanusDNA是一种创新的双向DNA基础模型,通过结合Mamba-注意力-专家混合的混合架构,将高效的自回归训练与双向理解能力相融合,能够处理长达100万个碱基对,并在基因组表征基准测试中取得了最先进的性能表现。
English Summary: JanusDNA is a novel bidirectional DNA foundation model that combines efficient autoregressive training with bidirectional comprehension using a hybrid Mamba-Attention-MoE architecture, enabling processing of up to 1 million base pairs while achieving state-of-the-art results on genomic benchmarks.

Authors:Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan
Title: ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
Abstract:
Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.
中文: 本文提出一种无超参数简洁度评分方法,通过强化学习框架优化推理路径的准确性与简洁性,在多个数据集上实现最优效率-准确率平衡,并能根据问题难度动态调整推理长度。
English: This paper introduces a hyperparameter-free conciseness score integrated into a reinforcement learning framework to optimize reasoning traces for both correctness and brevity, achieving superior efficiency-accuracy trade-offs across multiple datasets while dynamically adapting to problem difficulty.

Authors:Niklas Holzner, Sebastian Maier, Stefan Feuerriegel
Title: Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis
Abstract:
Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.
中文: 荟萃分析表明,生成式人工智能在创造力方面未超越人类,但人机协作能显著提升创意表现,不过会降低想法多样性。
English: A meta-analysis reveals that generative AI does not outperform humans in creativity but significantly enhances human creative performance when used collaboratively, although it reduces idea diversity.

Authors:Hongjian Zhou, Haoyu Yang, Ziang Ying, Nicholas Gangi, Zhaoran, Huang, Haoxing Ren, Joaquin Matres, Jiaqi Gu
Title: LiDAR 2.0: Hierarchical Curvy Waveguide Detailed Routing for Large-Scale Photonic Integrated Circuits
Abstract:
Driven by innovations in photonic computing and interconnects, photonic integrated circuit (PIC) designs advance and grow in complexity. Traditional manual physical design processes have become increasingly cumbersome. Available PIC layout tools are mostly schematic-driven, which has not alleviated the burden of manual waveguide planning and layout drawing. Previous research in PIC automated routing is largely adapted from electronic design, focusing on high-level planning and overlooking photonic-specific constraints such as curvy waveguides, bending, and port alignment. As a result, they fail to scale and cannot generate DRV-free layouts, highlighting the need for dedicated electronic-photonic design automation tools to streamline PIC physical design. In this work, we present LiDAR, the first automated PIC detailed router for large-scale designs. It features a grid-based, curvy-aware A* engine with adaptive crossing insertion, congestion-aware net ordering, and insertion-loss optimization. To enable routing in more compact and complex designs, we further extend our router to hierarchical routing as LiDAR 2.0. It introduces redundant-bend elimination, crossing space preservation, and routing order refinement for improved conflict resilience. We also develop and open-source a YAML-based PIC intermediate representation and diverse benchmarks, including TeMPO, GWOR, and Bennes, which feature hierarchical structures and high crossing densities. Evaluations across various benchmarks show that LiDAR 2.0 consistently produces DRV-free layouts, achieving up to 16% lower insertion loss and 7.69x speedup over prior methods on spacious cases, and 9% lower insertion loss with 6.95x speedup over LiDAR 1.0 on compact cases. Our codes are open-sourced at https://github.com/ScopeX-ASU/LiDAR.
中文:LiDAR 2.0是一款先进的光子集成电路自动布线工具,能高效生成无缺陷布局,显著降低插入损耗并提升运算速度,有效解决了以往方法在处理复杂光子特定约束方面的不足。
English: LiDAR 2.0 is an advanced automated photonic integrated circuit router that efficiently generates defect-free layouts with reduced insertion loss and faster performance, addressing the limitations of previous methods in handling complex photonic-specific constraints.

Authors:Siyang Song, Micol Spitale, Xiangyu Kong, Hengde Zhu, Cheng Luo, Cristina Palmero, German Barquero, Sergio Escalera, Michel Valstar, Mohamed Daoudi, Tobias Baur, Fabien Ringeval, Andrew Howes, Elisabeth Andre, Hatice Gunes
Title: REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge
Abstract:
In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal MAFRG dataset (called MARS) recording 137 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2025
中文摘要:REACT 2025挑战赛旨在开发能生成多样化同步面部反应的机器学习模型,并提供了大规模人类对话数据集MARS作为关键支持。
English Summary: The REACT 2025 challenge promotes developing ML models to generate diverse and synchronized facial reactions in dyadic interactions, supported by the large-scale MARS dataset of human-human conversations.

Authors:Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, Shrikanth Narayanan
Title: Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts
Abstract:
Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at https://github.com/gchochla/liahr.
中文摘要:该研究针对主观性自然语言处理任务中的标注差异问题,提出了"大海捞针式标签修正"框架,利用大语言模型识别并修正标签偏差,通过验证后的标签替换而非丢弃不一致数据来提升标注质量。
English Summary: The study addresses annotation variability in subjective NLP tasks by introducing the Label-in-a-Haystack Rectification framework, which uses LLMs to identify and correct label discrepancies, thereby improving annotation quality through validated label replacement instead of discarding inconsistent data.

Authors:Kangda Wei, Hasnat Md Abdullah, Ruihong Huang
Title: Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs
Abstract:
Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data. We release the code and generated data at: https://github.com/WeiKangda/LLMs-Exploratory-Bias-Mitigation/tree/main.
Chinese: 本研究提出了一种数据生成框架,通过创建平衡的故事对并采用直接偏好优化方法,有效减少大型语言模型中的性别偏见,同时保持模型性能。
English: This study introduces a data generation framework to mitigate gender bias in Large Language Models by creating balanced story pairs and using Direct Preference Optimization, which effectively reduces bias while maintaining model performance.

Authors:Huaiyuan Yao, Pengfei Li, Bu Jin, Yupeng Zheng, An Liu, Lisen Mu, Qing Su, Qian Zhang, Yilun Chen, Peng Li
Title: LiloDriver: A Lifelong Learning Framework for Closed-loop Motion Planning in Long-tail Autonomous Driving Scenarios
Abstract:
Recent advances in autonomous driving research towards motion planners that are robust, safe, and adaptive. However, existing rule-based and data-driven planners lack adaptability to long-tail scenarios, while knowledge-driven methods offer strong reasoning but face challenges in representation, control, and real-world evaluation. To address these challenges, we present LiloDriver, a lifelong learning framework for closed-loop motion planning in long-tail autonomous driving scenarios. By integrating large language models (LLMs) with a memory-augmented planner generation system, LiloDriver continuously adapts to new scenarios without retraining. It features a four-stage architecture including perception, scene encoding, memory-based strategy refinement, and LLM-guided reasoning. Evaluated on the nuPlan benchmark, LiloDriver achieves superior performance in both common and rare driving scenarios, outperforming static rule-based and learning-based planners. Our results highlight the effectiveness of combining structured memory and LLM reasoning to enable scalable, human-like motion planning in real-world autonomous driving. Our code is available at https://github.com/Hyan-Yao/LiloDriver.
中文摘要:LiloDriver是一种终身学习框架,通过结合大语言模型与记忆增强规划器,实现了自动驾驶中自适应运动规划,在nuPlan基准测试中于常见和罕见场景均表现优异。
English Summary: LiloDriver is a lifelong learning framework that integrates large language models with a memory-augmented planner to achieve adaptive motion planning in autonomous driving, demonstrating superior performance across both common and rare scenarios on the nuPlan benchmark.

Authors:Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin
Title: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Abstract:
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
中文: OCR-Reasoning基准测试旨在系统评估多模态大语言模型在文本丰富图像推理任务中的表现,结果显示即使最先进的模型也面临巨大挑战,准确率均未超过50%,凸显了解决这一问题的紧迫性。
English: The OCR-Reasoning benchmark is introduced to systematically evaluate Multimodal Large Language Models on text-rich image reasoning tasks, revealing that even state-of-the-art models struggle significantly with accuracy below 50%, highlighting an urgent need for improvement in this area.

Authors:Qin Chen, Yuanyi Ren, Xiaojun Ma, Yuyang Shi
Title: Large Language Models for Predictive Analysis: How Far Are They?
Abstract:
Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.
Chinese: 本文提出了PredictIQ基准,用于系统评估十二种大型语言模型的预测分析能力,发现尽管它们在此领域具有潜力,但仍面临重大挑战。
English: The PredictiQ benchmark is introduced to systematically evaluate the predictive analysis capabilities of twelve large language models, revealing that they still face significant challenges despite their potential in this domain.

Authors:Bohan Jin, Shuhan Qi, Kehai Chen, Xinyi Guo, Xuan Wang
Title: MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
Abstract:
The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.
中文: 本研究提出了双重隐性毒性和MDIT-Bench基准,发现当前大型多模态模型难以有效识别微妙的偏见与歧视问题,且在更高难度级别表现显著下降。
English: This study introduces dual-implicit toxicity and MDIT-Bench, a benchmark revealing that current large multimodal models struggle with detecting subtle prejudice and discrimination, especially at higher difficulty levels.

Authors:Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Title: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Abstract:
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.
中文: RAVEN通过其核心组件QuART实现了跨模态查询条件门控机制,能动态评估各模态标记的相关性以增强有效信号并抑制干扰,在多模态问答基准测试中显著提升准确率,并在模态受损情况下保持优异鲁棒性。
English: RAVEN introduces QuART, a query-conditioned cross-modal gating module that dynamically weights tokens across modalities to enhance relevant signals and suppress distractors, achieving significant accuracy improvements on multimodal QA benchmarks while maintaining robustness against modality corruption.

Authors:Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Title: CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution
Abstract:
Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public https://github.com/NYU-LLM-CTF/nyuctf_agents_craken.
中文: CRAKEN作为一种基于知识的大型语言模型代理框架,通过整合情境分解、迭代知识检索和自适应策略注入,提升了网络安全自动化能力,在漏洞检测和攻击技术执行方面实现了顶尖性能。
English: CRAKEN, a knowledge-based LLM agent framework, enhances cybersecurity automation by integrating contextual decomposition, iterative knowledge retrieval, and adaptive strategy injection, achieving state-of-the-art performance in vulnerability detection and attack technique execution.

Authors:Xiaozhao Liu, Dinggang Shen, Xihui Liu
Title: Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation
Abstract:
Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.
中文: 预训练生成模型通过从脑电图信号合成文本推进了脑解码,但存在可靠性问题,本研究通过语义概括方法和提出的GLIM模型来解决,以增强语义基础和评估。
English: Pretrained generative models advance brain decoding by synthesizing text from EEG signals, but face reliability issues, which this study addresses through a semantic summarization approach and the proposed GLIM model to enhance grounding and evaluation.

Authors:Kaibo Huang, Zipei Zhang, Yukun Wei, TianXin Zhang, Zhongliang Yang, Linna Zhou
Title: GSDFuse: Capturing Cognitive Inconsistencies from Multi-Dimensional Weak Signals in Social Media Steganalysis
Abstract:
The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. Steganalysis is profoundly hindered by the challenge of identifying subtle cognitive inconsistencies arising from textual fragmentation and complex dialogue structures, and the difficulty in achieving robust aggregation of multi-dimensional weak signals, especially given extreme steganographic sparsity and sophisticated steganography. These core detection difficulties are compounded by significant data imbalance. This paper introduces GSDFuse, a novel method designed to systematically overcome these obstacles. GSDFuse employs a holistic approach, synergistically integrating hierarchical multi-modal feature engineering to capture diverse signals, strategic data augmentation to address sparsity, adaptive evidence fusion to intelligently aggregate weak signals, and discriminative embedding learning to enhance sensitivity to subtle inconsistencies. Experiments on social media datasets demonstrate GSDFuse's state-of-the-art (SOTA) performance in identifying sophisticated steganography within complex dialogue environments. The source code for GSDFuse is available at https://github.com/NebulaEmmaZh/GSDFuse.
中文摘要:本文提出GSDFuse新方法,通过整合分层特征工程、数据增强、自适应融合和判别学习,系统解决了社交媒体恶意语言隐写检测中的核心难题,在复杂对话环境中实现了最先进的检测性能。
English Summary: This paper introduces GSDFuse, a novel method that overcomes key challenges in detecting malicious linguistic steganography on social media by integrating hierarchical feature engineering, data augmentation, adaptive fusion, and discriminative learning, achieving state-of-the-art performance.

Authors:Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen
Title: Synthetic Data RL: Task Definition Is All You Need
Abstract:
Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.
中文: 合成数据强化学习提出了一种仅通过任务定义生成合成数据来微调模型的框架,在多个基准测试中取得显著性能提升,同时减少了对人工标注数据的依赖。
English: Synthetic Data RL introduces a framework that fine-tunes models using only synthetic data generated from task definitions, achieving significant performance improvements across various benchmarks while reducing reliance on human-labeled data.

Authors:Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, Tieniu Tan
Title: Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.
中文: 提出的混合解码方法通过评估原始图像标记与模型关注标记输出的一致性,动态调整解码策略,有效缓解大型视觉语言模型的幻觉问题,并在多个基准测试中显著优于现有方法。
English: The proposed Mixture of Decoding (MoD) approach dynamically adjusts decoding strategies by assessing the consistency between outputs from original and attended image tokens, effectively mitigating hallucinations in Large Vision-Language Models and outperforming existing methods across multiple benchmarks.

Authors:Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
Title: SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation
Abstract:
In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.
Chinese: SALMONN-omni首次提出了无需音频编解码器的独立全双工语音大模型,其动态思维机制能自主切换听说状态,在语音问答基准测试中性能提升超30%,并在复杂对话场景中表现卓越。
English: SALMONN-omni introduces the first standalone full-duplex speech LLM that eliminates audio codecs and incorporates a dynamic thinking mechanism to seamlessly switch between speaking and listening, achieving over 30% performance improvement in benchmarks while excelling in complex conversational scenarios.

Authors:Jingzhi Hu, Geoffrey Ye Li
Title: Distillation-Enabled Knowledge Alignment Protocol for Semantic Communication in AI Agent Networks
Abstract:
Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation-enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter-efficient low-rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large-scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.
中文: 未来网络将连接大量AI代理以协作完成任务,得益于语义通信提升带宽效率,但需解决知识对齐问题,本文提出的DeKAP协议通过知识蒸馏实现资源最小化的高效对齐。
English: Future networks will connect numerous AI agents for collaborative tasks, benefiting from semantic communication that enhances bandwidth efficiency, but require knowledge alignment which is addressed by the proposed DeKAP protocol using distillation to minimize resources while maintaining performance.

Authors:Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Title: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Abstract:
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
中文摘要:GoT-R1是一个强化学习框架,通过让模型自主开发复杂文本提示的推理策略,并采用统一奖励机制评估语义对齐与空间精度,显著提升了多对象空间关系和属性绑定的图像生成能力。
English Summary: GoT-R1 is a reinforcement learning framework that enhances visual generation by enabling models to autonomously develop reasoning strategies for complex text prompts, achieving superior performance in spatial relationships and attribute binding through a unified reward system.

Authors:Sara Ghaboura, Ketan More, Wafa Alghallabi, Omkar Thawakar, Jorma Laaksonen, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
Abstract:
As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB
中文摘要:综合阿拉伯语多模态推理基准(ARB)作为首个评估阿拉伯语多模态逐步推理的框架,揭示了现有模型在连贯性和文化认知方面的持续不足,同时推动了包容性人工智能的发展。
English Summary: The Comprehensive Arabic Multimodal Reasoning Benchmark (ARB) is introduced as the first evaluation framework for step-by-step multimodal reasoning in Arabic, revealing persistent challenges in existing models' coherence and cultural awareness while promoting inclusive AI development.

Authors:Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang
Title: CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Abstract:
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
中文:CrossLMM通过双交叉注意力机制,在减少视觉令牌的同时保持性能完整性,有效解决了大型多模态模型处理长视频序列时的计算效率问题。
English: CrossLMM addresses the computational inefficiency of processing long video sequences in Large Multimodal Models by introducing a dual cross-attention mechanism that significantly reduces visual tokens while maintaining performance through enhanced text-visual interactions.

Authors:Chenhao Zhang, Yazhe Niu
Title: Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework
Abstract:
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.
中文: LAD框架通过感知、知识搜索和推理三阶段方法,解决了AI系统理解图像隐喻含义的难题,在图像隐含意义基准测试中取得了最先进的性能表现。
English: The LAD framework addresses AI's challenge in understanding metaphorical image implications by employing a three-stage process of perception, knowledge search, and reasoning, achieving state-of-the-art performance on image implication benchmarks.

Authors:Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue
Title: SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Abstract:
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.
Chinese: 最新研究提出SophiaVL-R1方法,通过在多模态大语言模型中引入思维过程奖励与结果奖励相结合的方式,有效提升了模型的推理策略和泛化能力,在多个基准测试中实现了优于更大规模模型的性能表现。
English: Recent research introduces SophiaVL-R1, a method that enhances multimodal large language models by incorporating process-based rewards alongside outcome rewards to improve reasoning strategies and generalization, achieving superior performance on benchmarks despite smaller model sizes.

Authors:Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
Abstract:
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文: 本研究首次对自回归图像生成中的GRPO和DPO强化学习算法进行全面分析,揭示了它们各自的优势,并证明具有更强泛化能力的奖励模型能够同时提升域内性能和跨域泛化能力。
English: This study provides the first comprehensive analysis of GRPO and DPO reinforcement learning algorithms in autoregressive image generation, revealing their distinct advantages and demonstrating how reward models with stronger generalization capabilities can enhance both in-domain performance and out-of-domain generalization.

Authors:Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Abstract:
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
中文: AdapTok是一种自适应视频分词器,通过内容感知策略动态分配每帧的令牌,在固定预算下无需额外数据即可提升视频重建与生成效率。
English: AdapTok is an adaptive video tokenizer that dynamically allocates tokens per frame using a content-aware strategy, enhancing reconstruction and generation efficiency under controlled budgets without extra data.

Authors:Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Title: R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
Abstract:
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.
中文: 本文提出R1-Searcher++框架,通过两阶段训练策略使大语言模型能够自适应地利用内部和外部知识,相比现有方法实现了更优的性能和高效的检索能力。
English: This paper introduces R1-Searcher++, a framework that trains LLMs to adaptively use both internal and external knowledge through a two-stage training strategy, achieving superior performance and efficient retrieval compared to previous methods.

Authors:Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar
Title: Guided Diffusion Sampling on Function Spaces with Applications to PDEs
Abstract:
We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at https://github.com/neuraloperator/FunDPS
中文摘要:本文提出FunDPS框架,通过函数空间扩散模型和即插即用引导机制,在仅有3%观测数据的极端条件下实现PDE反问题的高精度求解,其无网格特性为前后问题提供了首个独立于离散化的实用解决方案。
English Summary: This paper introduces FunDPS, a discretization-agnostic diffusion framework for solving PDE inverse problems that accurately recovers full solutions from minimal data using neural operators and gradient guidance, achieving significant accuracy improvements with fewer sampling steps.

Authors:Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao
Title: Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?
Abstract:
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.
Chinese: 本研究利用形式语言全面评估大语言模型的逻辑推理能力,发现思维模型优于指令模型,所有模型在归纳推理方面存在不足,而PoT格式数据泛化性能最佳,且拒绝式微调可进一步提升模型表现。
English: This study comprehensively evaluates large language models' logical reasoning capabilities using formal languages, finding that thinking models outperform instruct models, all models struggle with inductive reasoning, and PoT-formatted data yields the best generalization, with rejected fine-tuning further enhancing performance.

Authors:Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, Liqiang Nie
Title: $\text{R}^2\text{ec}$: Towards Large Recommender Models with Reasoning
Abstract:
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at https://github.com/YRYangang/RRec.
Chinese: 作者提出了一种具有内在推理能力的统一大型推荐模型,通过自回归架构和强化学习框架将推理与推荐相结合,无需专门推理标注即可同时优化两者,实现了显著的性能提升。
English: The authors propose a unified large recommender model with intrinsic reasoning capabilities, integrating reasoning and recommendation through an autoregressive architecture and a reinforcement learning framework that optimizes both without requiring specialized reasoning annotations, achieving significant performance improvements.

Authors:Aleksandra Franz, Hao Wei, Luca Guastoni, Nils Thuerey
Title: PICT -- A Differentiable, GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics
Abstract:
Despite decades of advancements, the simulation of fluids remains one of the most challenging areas of in scientific computing. Supported by the necessity of gradient information in deep learning, differentiable simulators have emerged as an effective tool for optimization and learning in physics simulations. In this work, we present our fluid simulator PICT, a differentiable pressure-implicit solver coded in PyTorch with Graphics-processing-unit (GPU) support. We first verify the accuracy of both the forward simulation and our derived gradients in various established benchmarks like lid-driven cavities and turbulent channel flows before we show that the gradients provided by our solver can be used to learn complicated turbulence models in 2D and 3D. We apply both supervised and unsupervised training regimes using physical priors to match flow statistics. In particular, we learn a stable sub-grid scale (SGS) model for a 3D turbulent channel flow purely based on reference statistics. The low-resolution corrector trained with our solver runs substantially faster than the highly resolved references, while keeping or even surpassing their accuracy. Finally, we give additional insights into the physical interpretation of different solver gradients, and motivate a physically informed regularization technique. To ensure that the full potential of PICT can be leveraged, it is published as open source: https://github.com/tum-pbs/PICT.
中文: PICT是一种基于PyTorch的GPU加速可微分流体模拟器,通过监督和无监督训练有效学习湍流模型,在保持甚至超越高分辨率基准精度的同时大幅提升计算速度,并已开源发布。
English: PICT is a GPU-accelerated differentiable fluid simulator in PyTorch that enables efficient learning of turbulence models through supervised and unsupervised training, achieving higher speed and accuracy than high-resolution references while being open-sourced.

Authors:Runpeng Yu, Xinyin Ma, Xinchao Wang
Title: Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Abstract:
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.
中文摘要:本文提出了首个离散扩散多模态大语言模型Dimple,通过结合自回归与扩散训练的新范式解决了训练不稳定和长度偏差问题,不仅性能超越同类模型,还通过置信解码显著提升推理效率,并利用结构先验实现精细化的响应控制。
English Summary: This paper introduces Dimple, the first discrete diffusion multimodal large language model, which overcomes training instability and length bias through a hybrid autoregressive-diffusion approach, achieving superior performance and enhanced inference efficiency via confident decoding while enabling precise response control through structure priors.

Authors:Moru Liu, Hao Dong, Jessica Kelly, Olga Fink, Mario Trapp
Title: Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation
Abstract:
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.
Chinese: 本文提出特征混合法,一种快速通用的多模态异常合成方法,有效提升多数据集上的OOD检测性能,在实现最优表现的同时大幅提升运算速度。
English: This paper introduces Feature Mixing, a fast and versatile method for multimodal outlier synthesis that enhances OOD detection across various datasets, achieving top performance with significant speed improvements.

Authors:Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen
Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
Abstract:
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.
Chinese: 本研究通过识别关键不匹配问题并引入无需修改架构的分组位置编码方法,有效提升了批处理大语言模型在流式应用中的性能,并在多任务实验中验证了其优越性。
English: This study addresses the inefficiencies in adapting batch-oriented Large Language Models for streaming by identifying key mismatches and introducing a group position encoding method that enhances performance without architectural changes, validated across diverse tasks.

Authors:Weizhi Tang, Yixuan Li, Chris Sypherd, Elizabeth Polgreen, Vaishak Belle
Title: HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation
Abstract:
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
中文: 本文通过引入新数据集和评估指标,研究并提升了大语言模型在少样本语法生成方面的能力,发现现有模型表现欠佳,进而提出HyGenar混合遗传算法,显著提高了生成语法的句法和语义正确性。
English: This paper investigates and enhances the few-shot grammar generation capabilities of large language models by introducing a new dataset and metrics, revealing their suboptimal performance and proposing HyGenar, a hybrid genetic algorithm that significantly improves the syntactic and semantic correctness of generated grammars.

Authors:Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
Abstract:
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.
中文摘要:本文提出了一种新颖的基于扩散模型的虚拟试穿方法,通过利用显式视觉对应关系和三维感知语义点匹配来引导扩散过程,显著提升了服装细节保持能力,在多个基准数据集上实现了最先进的性能表现。
English Summary: This paper introduces a novel diffusion-based virtual try-on method that enhances garment detail preservation by using explicit visual correspondence and 3D-aware semantic point matching to guide the diffusion process, achieving state-of-the-art performance on benchmark datasets.

Authors:Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen
Title: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
Abstract:
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.
中文: SWE-Dev数据集作为首个针对现实世界功能开发任务的大规模基准,不仅揭示了当前AI模型在此领域的重大挑战,还通过高质量训练数据显著提升了模型性能,使7B参数模型在困难任务上达到与GPT-4o相当的水平。
English: The SWE-Dev dataset is introduced as the first large-scale benchmark for evaluating and training autonomous coding systems on real-world feature development tasks, demonstrating both the challenge of this domain for current AI models and the dataset's effectiveness in enabling significant model improvements through fine-tuning.

Authors:Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer
Title: OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning
Abstract:
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.
Chinese: OpenSeg-R通过引入基于大型多模态模型的逐步视觉推理框架,生成分层推理提示以增强开放词汇分割,在多个基准测试中显著提升了准确性和可解释性。
English: OpenSeg-R introduces a step-by-step visual reasoning framework using Large Multimodal Models to enhance open-vocabulary segmentation by generating hierarchical reasoning prompts, significantly improving accuracy and interpretability across multiple benchmarks.

Authors:Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
Title: Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
Abstract:
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
中文: 通过剔除低质量数据集并采用级联大语言模型提示重标注假阴性样本,显著提升了检索和重排序模型在BEIR与AIR-Bench评估中的性能表现。
English: Pruning low-quality datasets and using cascading LLM prompts to relabel false negatives significantly improves retrieval and reranker model performance on BEIR and AIR-Bench evaluations.

Authors:Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev
Title: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Abstract:
Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
中文: 尽管大型语言模型的安全性有所进步,对抗性攻击仍能有效引发有害输出,而提出的MixAT方法在训练中结合离散和连续攻击,以最小计算开销显著提升模型鲁棒性。
English: Despite advancements in Large Language Model (LLM) safety, adversarial attacks still effectively induce harmful outputs, and the proposed MixAT method combines discrete and continuous attacks during training to significantly enhance robustness with minimal computational overhead.

Authors:InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Title: InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
Abstract:
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
中文: InternAgent作为一种统一的多智能体框架,通过其可扩展性、交互性和高效性,在多个科学领域推动自主研究,实现快速创新并与领域专家无缝协作。
English: InternAgent is a unified multi-agent framework that accelerates autonomous scientific research across various fields, offering scalability, interactivity, and efficiency by enabling rapid innovation and seamless human-expert collaboration.

Authors:Daniel F. Perez-Ramirez, Dejan Kostic, Magnus Boman
Title: CASTILLO: Characterizing Response Length Distributions of Large Language Models
Abstract:
Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.
中文摘要:CASTILLO数据集系统刻画了13种开源大语言模型的响应长度分布,揭示了显著的长度变异性,为开发预测模型以实现推理资源的前瞻性调度提供了基础。
English Summary: CASTILLO is a dataset that characterizes response length distributions across 13 open-source LLMs, revealing significant variability and enabling predictive models for proactive resource allocation in LLM inference.

Authors:Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia
Title: Training-Free Efficient Video Generation via Dynamic Token Carving
Abstract:
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga
中文:Jenga是一种创新的推理流程,通过动态优化注意力机制和逐步提升分辨率,显著加速视频扩散模型的生成速度,同时保持生成质量基本不变。
English: Jenga is an innovative inference pipeline that accelerates video diffusion models by dynamically optimizing attention mechanisms and progressively increasing resolution, achieving significant speed improvements without compromising generation quality.

Authors:Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
Title: Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Abstract:
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.
中文: TON方法采用两阶段训练策略,使视觉语言模型能够选择性跳过不必要的推理步骤,在多种任务中保持或提升性能的同时,将完成长度减少高达90%。
English: The TON method introduces a two-stage training strategy that enables vision-language models to selectively skip unnecessary reasoning steps, reducing completion length by up to 90% while maintaining or improving performance across various tasks.

Authors:Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, Dacheng Tao
Title: R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
Abstract:
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress
Chinese: R1-Compress是一种两阶段分块压缩框架,通过在MATH500等测试中实现92.4%的准确率并减少20%的token使用量,有效解决了长链思维推理中的计算效率问题。
English: R1-Compress is a two-stage chunk-level compression framework that significantly reduces token usage in Long-CoT reasoning while maintaining high reasoning accuracy, as demonstrated by achieving 92.4% accuracy with 20% fewer tokens on MATH500.

Authors:Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Abstract:
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.
中文: SimpleDeepSearcher 是一个轻量级框架,通过模拟真实用户交互和多标准筛选策略合成高质量训练数据,仅需少量样本即可超越基于强化学习的基线方法,有效解决了检索增强生成系统的数据稀缺瓶颈。
English: SimpleDeepSearcher is a lightweight framework that overcomes limitations in retrieval-augmented generation systems by synthesizing high-quality training data through simulated user interactions and multi-criteria curation, achieving superior performance with minimal samples compared to reinforcement learning approaches.

Authors:Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Abstract:
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.
中文: SimpleDeepSearcher 是一个轻量级框架,通过模拟真实用户交互和多标准筛选策略合成高质量训练数据,仅需少量样本即可超越基于强化学习的基线方法,有效解决了检索增强生成系统的数据稀缺瓶颈。
English: SimpleDeepSearcher is a lightweight framework that overcomes limitations in retrieval-augmented generation systems by synthesizing high-quality training data through simulated user interactions and multi-criteria curation, achieving superior performance with minimal samples compared to reinforcement learning approaches.

Authors:Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao
Title: From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Abstract:
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.
Chinese: 本文提出了用于评估教育领域基础模型视觉推理能力的基准EduVisBench,并开发了多智能体框架EduVisAgent,该框架通过协同合作显著提升了符合教学需求的可视化内容生成效果。
English: This paper introduces EduVisBench, a benchmark for evaluating the visual reasoning capabilities of foundation models in education, and proposes EduVisAgent, a multi-agent framework that significantly enhances the generation of pedagogically effective visualizations.

Authors:Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Title: Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Abstract:
Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
中文: 当前大语言模型遗忘效果评估依赖的词汇级指标存在误导性,因为模型可能只是表面遗忘而信息仍可恢复,这凸显了需要建立表征分析框架来区分可逆与不可逆遗忘的必要性。
English: Current token-level metrics for evaluating unlearning in LLMs can be misleading, as models may only superficially forget information that remains recoverable, prompting the need for a new representational analysis framework to distinguish between reversible and irreversible forgetting.

Authors:Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, Liqiang Nie
Title: GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent
Abstract:
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at https://github.com/JiuTian-VL/GUI-explorer.
中文:GUI-explorer是一种无需训练的自动化代理,通过自主探索应用功能和提取屏幕操作知识,在移动自动化基准测试中实现了最先进的性能表现。
English: GUI-explorer is a training-free agent that autonomously explores application functionalities and extracts screen-operation knowledge without human involvement, achieving state-of-the-art performance on mobile automation benchmarks.

Authors:Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai
Title: Perceptual Quality Assessment for Embodied AI
Abstract:
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA
中文: 本文提出了面向具身AI的图像质量评估框架Embodied-IQA,通过构建包含数万图像对与精细标注的数据库,建立了感知-认知-决策-执行的完整评估体系,旨在解决现有方法无法衡量机器人任务中图像可用性的关键问题。
English: This paper introduces Embodied-IQA, a novel framework for assessing image quality in embodied AI tasks by establishing a comprehensive database and evaluation pipeline to address the gap in current methods that fail to measure usability for robotic perception.

Authors:Junze Wang, Lei Fan, Weipeng Jing, Donglin Di, Yang Song, Sidong Liu, Cong Cong
Title: Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities
Abstract:
Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2% in the Dice Similarity Coefficient across various tumor regions. Our code is available at https://github.com/reeive/ReHyDIL.
Chinese: 本文提出ReHyDIL方法,通过领域增量学习和超图网络处理多模态MRI中缺失模态的脑肿瘤分割问题,在BraTS2019数据集上各项肿瘤区域的Dice系数提升超过2%。
English: This paper introduces ReHyDIL, a method that uses domain incremental learning and hypergraph networks to improve brain tumor segmentation in multimodal MRI when modalities are missing, showing over 2% Dice improvement on BraTS2019.

Authors:KiHyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung
Title: SEED: Speaker Embedding Enhancement Diffusion Model
Abstract:
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch
Chinese: 本文提出一种基于扩散模型的方法,通过优化说话人嵌入来缓解环境不匹配导致的性能下降,无需说话人标签或修改现有系统即可将识别准确率最高提升19.6%。
English: This paper introduces a diffusion-based method that refines speaker embeddings to mitigate performance degradation from environmental mismatch, improving recognition accuracy by up to 19.6% without requiring speaker labels or changes to existing systems.

Authors:Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You
Title: REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training
Abstract:
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .
中文:HASTE提出了一种两阶段训练方案,先通过师生模型对齐实现快速收敛,随后终止对齐以释放生成潜力,无需改变架构即可大幅加速扩散模型的训练。
English: HASTE introduces a two-phase training schedule that initially aligns a diffusion transformer with a teacher model for rapid convergence and then terminates alignment to unleash its full generative capacity, achieving significant speedups without architectural changes.

Authors:Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
Title: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Abstract:
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.
中文摘要:针对特定领域数据微调大型语言模型可能意外引入脆弱性,这些脆弱性受数据集的语言特征和毒性等因素影响,从而削弱模型鲁棒性,并凸显了策略性数据集设计对防御的重要性。
English Summary: Fine-tuning large language models on domain-specific data can inadvertently introduce accidental vulnerabilities, which are influenced by factors like linguistic features and toxicity in the datasets, ultimately affecting model robustness and highlighting the importance of strategic dataset design for defense.

Authors:Jun Xie, Xiongjun Guan, Yingjian Zhu, Zhaoran Zhao, Xinming Wang, Hongzhu Yi, Feng Chen, Zhepeng Wang
Title: Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles
Abstract:
In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field. Our Code is available at https://github.com/XiongjunGuan/EgoSchema-CVPR25.
中文摘要:本文提出了CVPR 2025 Ego4D EgoSchema挑战赛的亚军解决方案,通过小样本学习和模型集成策略将多模态大模型应用于视频理解任务,实现了超越现有最优方法的性能表现。
English Summary: This paper introduces the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025, which adapts multimodal large models to video understanding through few-shot learning and ensemble strategies, achieving state-of-the-art performance.

Authors:Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen
Title: Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning
Abstract:
Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.
中文: 本文对大型语言模型中的潜在思维链推理进行了全面综述,提出了统一分类法并分析各类方法,旨在推动这种解耦式推理范式的发展,以实现更高效灵活的推理能力。
English: This paper provides a comprehensive overview of latent Chain-of-Thought reasoning in Large Language Models, proposing a unified taxonomy and analyzing methods to advance this decoupled reasoning approach for more efficient and flexible inference.

Authors:Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw
Title: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Abstract:
Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
中文: 大语言模型在多模态环境下的指令遵循能力常会减弱,为此我们开发了IFEval-Audio数据集,包含280组音频-指令-答案三元组,用于从六个维度评估音频大模型的指令执行能力。
English: Large language models' instruction-following ability often weakens in multimodal settings, prompting the creation of IFEval-Audio, a dataset with 280 audio-instruction-answer triples to evaluate audio-based LLMs across six dimensions.

Authors:Florentin Beck, William Rudman, Carsten Eickhoff
Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
Abstract:
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
中文:TRIM提出了一种针对性的迭代剪枝方法,通过为各层内部维度分配差异化稀疏度,在多类大语言模型压缩中实现了最优性能与稳定性。
English: TRIM introduces a targeted, iterative pruning method that applies varying sparsity to individual dimensions within layers, achieving state-of-the-art performance and stability in LLM compression across multiple models.

Authors:Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun
Title: Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Abstract:
The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.
Chinese: 本文提出了一种安全感知探测(SAP)框架,通过在梯度传播过程中引入安全感知探针,有效减轻大型语言模型在微调过程中的安全性退化,在保持性能的同时显著降低有害性。
English: The paper introduces a safety-aware probing (SAP) framework that mitigates safety degradation in large language models during fine-tuning by incorporating safety-aware probes into gradient propagation, effectively reducing harmfulness while maintaining performance.

Authors:Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Title: Forward-only Diffusion Probabilistic Models
Abstract:
This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at https://github.com/Algolzw/FoD.
Chinese: 本文提出了一种仅前向扩散(FoD)模型,通过采用具有均值回复特性的随机微分方程的单向前向扩散过程,简化了生成建模,并在图像修复和图像转换任务中实现了最先进的性能。
English: This paper introduces a forward-only diffusion (FoD) model that simplifies generative modeling by using a single forward diffusion process with a mean-reverting stochastic differential equation, achieving state-of-the-art performance in image restoration and image-to-image translation tasks.

Authors:Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Title: Forward-only Diffusion Probabilistic Models
Abstract:
This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves state-of-the-art performance on various image restoration tasks. Its general applicability on image-conditioned generation is also demonstrated via qualitative results on image-to-image translation. Our code is available at https://github.com/Algolzw/FoD.
Chinese: 本文提出了一种仅前向扩散(FoD)模型,通过采用具有均值回复特性的随机微分方程的单向前向扩散过程,简化了生成建模,并在图像修复和图像转换任务中实现了最先进的性能。
English: This paper introduces a forward-only diffusion (FoD) model that simplifies generative modeling by using a single forward diffusion process with a mean-reverting stochastic differential equation, achieving state-of-the-art performance in image restoration and image-to-image translation tasks.

Authors:Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Abstract:
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
中文摘要:本研究提出了跨语言去毒方法,通过大量实验验证了其在不同语言间降低大型语言模型毒性的有效性,同时揭示了安全性与知识保留之间的权衡关系。
English Summary: This study introduces cross-lingual detoxification to reduce toxicity in large language models across different languages, demonstrating its effectiveness through extensive testing while highlighting the trade-off between safety and knowledge preservation.

Authors:Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
Title: Training Long-Context LLMs Efficiently via Chunk-wise Optimization
Abstract:
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{https://github.com/wenhaoli-xmu/seco}{here}.
Chinese: 本文提出SeCO和SpaCO两种内存高效的训练方法,通过分块处理长输入来优化大语言模型的长上下文能力,在降低计算成本的同时实现了更长序列处理和更快训练速度。
English: This paper introduces SeCO and SpaCO, two memory-efficient training methods that optimize long-context LLMs by processing input in chunks, enabling longer sequence handling and faster training speeds while reducing computational costs.

Authors:Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
Title: R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Abstract:
In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.
中文摘要:本研究提出Share-GRPO方法,通过扩展问题空间并共享多样化推理路径和奖励信息,有效解决强化学习中的稀疏奖励和优势消失问题,从而提升多模态大语言模型的推理能力。
English Summary: This study introduces Share-GRPO, a reinforcement learning approach that enhances multimodal large language models' reasoning by expanding question spaces and sharing diverse reasoning trajectories and reward information to overcome sparse rewards and advantage vanishing issues.

Authors:Xinwei Yang, Zhaofeng Liu, Chen Huang, Jiashuai Zhang, Tong Zhang, Yifan Zhang, Wenqiang Lei
Title: ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
Abstract:
While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION
中文: 本研究提出了一个全面的人类反馈分类法、ELABORATIONSET数据集和ELABORATION基准,用于系统评估竞争性编程中的人机协作,并识别现有方法的优势与不足。
English: This study introduces a comprehensive taxonomy of human feedback, the ELABORATIONSET dataset, and the ELABORATION benchmark to systematically evaluate human-LLM collaboration in competitive programming, identifying strengths and weaknesses in existing methods.

Authors:Shinnosuke Ono, Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki
Title: A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
Abstract:
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
中文摘要:本研究开发了针对日语医药领域的专业语言模型,通过双语医学语料持续预训练,在超越现有开源模型的同时与商业模型性能相当,并建立了三个专业基准测试体系进行全面评估。
English Summary: This study introduces a Japanese pharmaceutical domain-specific language model, developed through continual pretraining on bilingual medical corpora, which outperforms existing open models and shows competitive performance with commercial ones while establishing three specialized benchmarks for comprehensive evaluation.

Authors:Giuseppe Guarino, Matteo Ciotola, Gemine Vivone, Giovanni Poggi, Giuseppe Scarpa
Title: Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control
Abstract:
Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on https://github.com/giu-guarino/rho-PNN.
中文: 本文提出一种轻量级无监督高光谱全色锐化方法,通过逐波段自适应网络权重并重新定义空间损失函数,确保所有波段光谱质量一致,无需外部训练数据即可获得与先进方法相媲美的效果。
English: This paper introduces a lightweight, unsupervised hyperspectral pansharpening method that adapts network weights per band and redefines spatial loss to ensure uniform spectral quality across all bands, achieving competitive results without external training data.

Authors:Michael Neri, Sara Baldoni
Title: Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images
Abstract:
Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at https://github.com/michaelneri/image-based-network-traffic-anomaly-detection.
中文: 本研究提出了一种基于图像的网络流量表示方法和无监督学习技术,能有效检测异构连接设备中的安全异常,同时降低对复杂处理架构的需求。
English: This work proposes an image-based network traffic representation and unsupervised learning approach to efficiently detect security anomalies in heterogeneous connected devices, reducing the need for complex processing architectures.

Authors:Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng
Title: Stochastic Forward-Forward Learning through Representational Dimensionality Compression
Abstract:
The Forward-Forward (FF) algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function to guide learning. Existing goodness functions, inspired by energy-based learning (EBL), are typically defined as the sum of squared post-synaptic activations, neglecting the correlations between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for clamped inputs when noise is considered while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples. We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared outputs, which is equivalent to making predictions based on the energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward
中文: 本研究为前向-前向算法提出了一种新颖的“维度压缩”优度函数,通过有效维度捕捉神经元相关性,无需负样本即可实现有竞争力的性能,同时揭示了噪声在增强泛化能力和推理中的建设性作用。
English: The study introduces a novel "dimensionality compression" goodness function for the Forward-Forward algorithm, using effective dimensionality to capture neuron correlations and achieve competitive performance without negative samples, while highlighting noise's constructive role in enhancing generalization and inference.

Authors:Sushant Gautam, Michael A. Riegler, Pål Halvorsen
Title: Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models
Abstract:
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at https://github.com/simula/PointDetectCount.
中文摘要:本研究通过指令微调视觉语言模型实现医学图像的多任务分析,在提升检测与计数精度的同时揭示了边缘案例可靠性的权衡,展示了适配临床诊断流程的复合推理能力。
English Summary: This study fine-tunes vision-language models for multi-task medical image analysis, demonstrating improved accuracy in detection and counting tasks while noting trade-offs in edge-case reliability through prompt-based adaptation of clinical workflows.

Authors:Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, PÃ¥l Halvorsen, Mubarak Shah
Title: SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding
Abstract:
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. https://github.com/simula/SoccerChat
中文: SoccerChat提出了一种多模态AI框架,通过整合视觉与文本数据提升足球视频理解能力,基于SoccerNet数据集在事件分类和裁判决策任务中展现出优越性能。
English: SoccerChat introduces a multimodal AI framework that integrates visual and textual data to enhance soccer video understanding, demonstrating improved performance in event classification and referee decision-making through the SoccerNet dataset.

Authors:Luyang Cao, Jianwei Li, Yinghuan Shi
Title: Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation
Abstract:
Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at https://github.com/caoluyang0830/CVBM.git.
中文: 提出的跨视角双向建模框架通过结合背景建模提升前景分割置信度,并引入双向一致性机制,仅用少量标注数据即在多个数据集上实现最优性能。
English: The proposed Cross-view Bidirectional Modeling (CVBM) framework enhances semi-supervised medical image segmentation by incorporating background modeling to boost foreground confidence and introducing bidirectional consistency, achieving state-of-the-art performance with minimal labeled data.

Authors:Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He
Title: Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
Abstract:
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .
中文:KoLasSimpleQA是首个评估大语言模型多语言事实知识能力的基准,涵盖九种语言和双领域设计,全面评估模型能力并揭示通用领域与语言特定领域之间的性能差异。
English: KoLasSimpleQA is the first multilingual factual knowledge benchmark for Large Language Models, featuring nine languages and dual-domain design to comprehensively assess capabilities and reveal performance gaps between general and language-specific domains.

Authors:Yongqi Zhao, Ji Zhou, Dong Bi, Tomislav Mihalj, Jia Hu, Arno Eichberger
Title: A Survey on the Application of Large Language Models in Scenario-Based Testing of Automated Driving Systems
Abstract:
The safety and reliability of Automated Driving Systems (ADSs) must be validated prior to large-scale deployment. Among existing validation approaches, scenario-based testing has been regarded as a promising method to improve testing efficiency and reduce associated costs. Recently, the emergence of Large Language Models (LLMs) has introduced new opportunities to reinforce this approach. While an increasing number of studies have explored the use of LLMs in the field of automated driving, a dedicated review focusing on their application within scenario-based testing remains absent. This survey addresses this gap by systematically categorizing the roles played by LLMs across various phased of scenario-based testing, drawing from both academic research and industrial practice. In addition, key characteristics of LLMs and corresponding usage strategies are comprehensively summarized. The paper concludes by outlining five open challenges and potential research directions. To support ongoing research efforts, a continuously updated repository of recent advancements and relevant open-source tools is made available at: https://github.com/ftgTUGraz/LLM4ADSTest.
中文摘要:本综述通过系统梳理大语言模型在自动驾驶场景测试各阶段的应用角色,总结了其关键特性与使用策略,填补了该领域专题研究的空白,并提出了五大开放挑战与研究方向,同时提供了持续更新的资源库以支持后续研究。
English Summary: This survey addresses the lack of dedicated reviews on applying Large Language Models (LLMs) to scenario-based testing for automated driving systems by systematically categorizing their roles across testing phases and summarizing key characteristics with usage strategies, while also outlining open challenges and maintaining an updated resource repository.

Authors:Siqu Ou, Hongcheng Liu, Pingjie Wang, Yusheng Liao, Chuan Xuan, Yanfeng Wang, Yu Wang
Title: Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
Abstract:
While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.
中文摘要:GRASSLAND基准测试和D2R框架通过将视觉草稿与文本推理链相结合,显著提升了多模态模型在动态空间推理任务中的表现,且无需模型微调。
English Summary: The GRASSLAND benchmark and D2R framework enhance dynamic spatial reasoning in multimodal models by integrating visual drafts with textual reasoning chains, achieving superior performance without fine-tuning.

Authors:Jannis Becktepe, Leona Hennig, Steffen Oeltze-Jafra, Marius Lindauer
Title: Auto-nnU-Net: Towards Automated Medical Image Segmentation
Abstract:
Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at https://github.com/automl/AutoNNUnet.
Chinese: Auto-nnU-Net 通过引入超参数优化和神经架构搜索,显著提升了 nnU-Net 在多个医学影像数据集上的分割性能,同时保持了实际应用中的计算资源可行性。
English: Auto-nnU-Net enhances the nnU-Net framework by incorporating hyperparameter optimization and neural architecture search, significantly improving segmentation performance on multiple medical imaging datasets while maintaining practical computational resource usage.

Authors:Ercong Nie, Helmut Schmid, Hinrich Schütze
Title: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Abstract:
Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion while largely preserving general competence and fluency. Our approach matches multilingual alignment in confusion reduction for many languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling. Code and data are available at: https://github.com/ercong21/lang_confusion.
中文摘要:本研究通过机制可解释性分析发现,大语言模型的语言混淆现象源于最终层的转换故障,并证明针对性神经元编辑能在保持模型性能的同时有效缓解该问题。
English Summary: This study uses mechanistic interpretability to identify that language confusion in LLMs stems from transition failures in final layers, demonstrating targeted neuron editing effectively mitigates the issue while maintaining model performance.

Authors:Yuliang Yan, Haochun Tang, Shuo Yan, Enyan Dai
Title: DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Abstract:
Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting $\textbf{F}$ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.
中文摘要:提出的DuFFin框架通过双重指纹识别技术,能在黑盒设置下精确验证大语言模型的所有权,在各类模型变体上实现了超过0.95的IP-ROC指标。
English Summary: The proposed DuFFin framework uses dual-level fingerprints to accurately verify the ownership of large language models in black-box settings, achieving high IP-ROC scores above 0.95 across various model variants.

Authors:Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao
Title: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Abstract:
While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.
中文: ALTo模型通过创新的长度预测器和优化策略,实现了自适应分词,在保持高效的同时,在主流分割基准上达到了最优性能。
English: The proposed ALTo model introduces an adaptive tokenizer with a novel length predictor and optimization strategy, enabling state-of-the-art segmentation performance while dynamically adjusting token usage for efficiency.

Authors:Jisu Han, Jaemin Na, Wonjun Hwang
Title: Ranked Entropy Minimization for Continual Test-Time Adaptation
Abstract:
Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at https://github.com/pilsHan/rem
中文: 本研究提出排序熵最小化方法,通过渐进式结构化预测难度并保持熵的排序,有效防止持续测试时适应中的模型崩溃,在多个基准测试中验证了其有效性。
English: The study introduces ranked entropy minimization to prevent model collapse in continual test-time adaptation by progressively structuring prediction difficulty while maintaining entropy rank order, demonstrating effectiveness across benchmarks.

Authors:Song Jin, Juntian Zhang, Yuhan Liu, Xun Zhang, Yufei Zhang, Guojun Yin, Fei Jiang, Wei Lin, Rui Yan
Title: Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems
Abstract:
Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research. Our codes are available at https://github.com/jinsong8/RecInter.
中文: 本文提出了RecInter这一创新的基于代理的模拟平台,通过实时动态用户交互重塑推荐系统环境,显著提升了模拟可信度并成功复现了品牌忠诚度等关键涌现现象。
English: This paper introduces RecInter, an innovative agent-based simulation platform that enhances recommender system testing by enabling dynamic user interactions to reshape the environment in real-time, significantly improving simulation credibility and replicating key emergent phenomena.

Authors:Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
Title: WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Abstract:
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
中文: WebAgent-R1是一种高效的端到端多轮强化学习框架,通过基于任务成功的二元奖励从在线交互中学习,显著提升了网络代理的任务成功率,在基准测试中优于现有最先进方法。
English: WebAgent-R1 is an effective end-to-end multi-turn reinforcement learning framework that significantly boosts web agents' task success rates by learning from online interactions with binary rewards, outperforming state-of-the-art methods on benchmarks.

Authors:Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Abstract:
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [https://github.com/lose4578/CircleRoPE](https://github.com/lose4578/CircleRoPE).
中文: 该摘要提出Circle-RoPE方法,通过将图像令牌投影到与文本令牌线性轴正交的环形结构上,形成锥形编码空间来消除视觉语言模型中的跨模态位置偏差,同时采用交错策略在不同层应用不同RoPE变体以保持图像空间信息。
English: The abstract introduces Circle-RoPE, a novel positional encoding method that projects image tokens onto a ring orthogonal to text tokens' linear axis to eliminate cross-modal biases in vision-language models, while preserving spatial information through a cone-like structure and staggered RoPE application across layers.

Authors:Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Abstract:
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.
中文: 该摘要提出Circle-RoPE方法,通过将图像令牌投影到与文本令牌线性轴正交的环形结构上,形成锥形编码空间来消除视觉语言模型中的跨模态位置偏差,同时采用交错策略在不同层应用不同RoPE变体以保持图像空间信息。
English: The abstract introduces Circle-RoPE, a novel positional encoding method that projects image tokens onto a ring orthogonal to text tokens' linear axis to eliminate cross-modal biases in vision-language models, while preserving spatial information through a cone-like structure and staggered RoPE application across layers.

Authors:Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Title: Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models. Our code is available at https://github.com/ruizheliUOA/ARC_JSD
This paper introduces ARC-JSD, a novel method that uses Jensen-Shannon Divergence to efficiently attribute generated responses to specific context segments in Retrieval-Augmented Generation systems, eliminating the need for fine-tuning while demonstrating superior accuracy and computational efficiency across multiple benchmarks.
English Summary:

Authors:Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Title: Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
This paper introduces ARC-JSD, a novel method that uses Jensen-Shannon Divergence to efficiently attribute generated responses to specific context segments in Retrieval-Augmented Generation systems, eliminating the need for fine-tuning while demonstrating superior accuracy and computational efficiency across multiple benchmarks.
English Summary:

Authors:Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu
Title: Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Abstract:
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
Chinese: SPIN是一种新颖的注意力引导头抑制方法,通过在推理过程中选择性抑制低注意力头来减少大型视觉语言模型的幻觉现象,在实现幻觉分数降低2.7倍的同时保持F1分数,并将吞吐量提升1.8倍且不增加延迟。
English: SPIN is a novel attention-guided head suppression method that reduces hallucinations in large vision-language models by selectively suppressing low-attention heads during inference, achieving up to 2.7x lower hallucination scores with 1.8x higher throughput and no latency overhead.

Authors:Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Title: Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Abstract:
Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.
中文:Tool-Star是一个基于强化学习的框架,通过两阶段训练和分层奖励设计,使大语言模型能够在推理过程中自主调用多种外部工具,在多项基准测试中展现出卓越性能。
English: Tool-Star is a reinforcement learning framework that enables large language models to autonomously use multiple external tools during reasoning through a two-stage training process and hierarchical reward design, demonstrating superior performance across various benchmarks.

Authors:Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji
Title: From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs
Abstract:
Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior. We release our code at https://github.com/faridlazuarda/from-surveys-to-narratives.
中文摘要:本研究发现仅依赖世界价值观调查数据进行大语言模型的文化适应会简化文化规范并损害事实准确性,但通过维基百科和NormAd的文化叙事进行补充后,尽管对下游任务影响不一,却显著提升了文化独特性。
English Summary: This study reveals that relying solely on World Values Survey data for cultural adaptation in LLMs can oversimplify cultural norms and impair factual accuracy, but augmenting with cultural narratives from Wikipedia and NormAd enhances cultural distinctiveness despite variable task impacts.

Authors:Huazi Pan, Yanjun Zhang, Leo Yu Zhang, Scott Adams, Abbas Kouzani, Suiyang Khoo
Title: Performance Guaranteed Poisoning Attacks in Federated Learning: A Sliding Mode Approach
Abstract:
Manipulation of local training data and local updates, i.e., the poisoning attack, is the main threat arising from the collaborative nature of the federated learning (FL) paradigm. Most existing poisoning attacks aim to manipulate local data/models in a way that causes denial-of-service (DoS) issues. In this paper, we introduce a novel attack method, named Federated Learning Sliding Attack (FedSA) scheme, aiming at precisely introducing the extent of poisoning in a subtle controlled manner. It operates with a predefined objective, such as reducing global model's prediction accuracy by 10%. FedSA integrates robust nonlinear control-Sliding Mode Control (SMC) theory with model poisoning attacks. It can manipulate the updates from malicious clients to drive the global model towards a compromised state, achieving this at a controlled and inconspicuous rate. Additionally, leveraging the robust control properties of FedSA allows precise control over the convergence bounds, enabling the attacker to set the global accuracy of the poisoned model to any desired level. Experimental results demonstrate that FedSA can accurately achieve a predefined global accuracy with fewer malicious clients while maintaining a high level of stealth and adjustable learning rates.
Chinese: 联邦学习滑动攻击(FedSA)是一种新型投毒方法,它利用滑模控制理论精细操控恶意客户端更新,以隐蔽方式将全局模型准确率精准降低至预设水平,且只需较少恶意客户端即可实现。
English: The Federated Learning Sliding Attack (FedSA) is a novel poisoning method that subtly manipulates malicious client updates using Sliding Mode Control to precisely degrade the global model's accuracy to a predefined level while remaining stealthy and requiring fewer malicious clients.

Authors:Yuanhao Huang, Yilong Ren, Jinlei Wang, Lujia Huo, Xuesong Bai, Jinchuan Zhang, Haiyan Yu
Title: AdvReal: Physical Adversarial Patch Generation Framework for Security Evaluation of Object Detection Systems
Abstract:
Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in security accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D domains, which simultaneously optimizes texture maps in 2D image and 3D mesh spaces to better address intra-class diversity and real-world environmental variations. The framework includes a novel realistic enhanced adversarial module, with time-space and relighting mapping pipeline that adjusts illumination consistency between adversarial patches and target garments under varied viewpoints. Building upon this, we develop a realism enhancement mechanism that incorporates non-rigid deformation modeling and texture remapping to ensure alignment with the human body's non-rigid surfaces in 3D scenes. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Specifically, our method achieves an average attack success rate (ASR) of 70.13% on YOLOv12 in physical scenarios, significantly outperforming existing methods such as T-SEA (21.65%) and AdvTexture (19.70%). Moreover, the proposed method maintains stable ASR across multiple viewpoints and distances, with an average attack success rate exceeding 90% under both frontal and oblique views at a distance of 4 meters. This confirms the method's strong robustness and transferability under multi-angle attacks, varying lighting conditions, and real-world distances. The demo video and code can be obtained at https://github.com/Huangyh98/AdvReal.git.
中文: 本研究提出了一种统一的联合对抗训练框架,通过优化2D和3D空间中的纹理映射,结合光照一致性调整和非刚性变形建模,在物理环境中有效误导目标检测模型,并在多视角和距离下保持稳定的高攻击成功率。
English: This study introduces a unified adversarial training framework that generates robust 2D and 3D adversarial textures, achieving high attack success rates against object detection models in real-world scenarios through illumination-consistent patches and non-rigid deformation modeling.

Authors:Qian Deng, Le Hui, Jin Xie, Jian Yang
Title: Sketchy Bounding-box Supervision for 3D Instance Segmentation
Abstract:
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at https://github.com/dengq7/Sketchy-3DIS.
Chinese: 提出的Sketchy-3DIS框架通过联合训练伪标签生成器和分割器来处理3D实例分割中不精确的边界框监督,在主要基准测试中实现了最先进的性能。
English: The proposed Sketchy-3DIS framework addresses inaccurate bounding box supervision in 3D instance segmentation by jointly training a pseudo labeler and segmentator to generate high-quality instances, achieving state-of-the-art performance on major benchmarks.

Authors:Zhixun Li, Bin Cao, Rui Jiao, Liang Wang, Ding Wang, Yang Liu, Dingshuo Chen, Jia Li, Qiang Liu, Yu Rong, Liang Wang, Tong-yi Zhang, Jeffrey Xu Yu
Title: Materials Generation in the Era of Artificial Intelligence: A Comprehensive Survey
Abstract:
Materials are the foundation of modern society, underpinning advancements in energy, electronics, healthcare, transportation, and infrastructure. The ability to discover and design new materials with tailored properties is critical to solving some of the most pressing global challenges. In recent years, the growing availability of high-quality materials data combined with rapid advances in Artificial Intelligence (AI) has opened new opportunities for accelerating materials discovery. Data-driven generative models provide a powerful tool for materials design by directly create novel materials that satisfy predefined property requirements. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. To fill this gap, this paper provides a comprehensive overview of recent progress in AI-driven materials generation. We first organize various types of materials and illustrate multiple representations of crystalline materials. We then provide a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss the common evaluation metrics and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future directions and challenges in this fast-growing field. The related sources can be found at https://github.com/ZhixunLEE/Awesome-AI-for-Materials-Generation.
Chinese: 人工智能与材料科学的结合通过数据驱动的生成模型,加速了新材料的发现与设计,为解决各领域的关键全球挑战提供了强大工具。
English: The integration of artificial intelligence with materials science enables accelerated discovery and design of novel materials through data-driven generative models, addressing critical global challenges across various sectors.

Authors:Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen
Title: DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
Abstract:
Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.
中文: DeCafNet采用“委托与攻克”策略,通过辅助编码器高效处理视频并筛选关键片段,再由专家编码器精炼,在长视频时序定位中实现高达47%的计算量削减,同时创下性能新纪录。
English: DeCafNet introduces a "delegate-and-conquer" strategy using a sidekick encoder for efficient video processing and an expert encoder for refining key clips, achieving up to 47% computation reduction while setting new state-of-the-art performance in long video temporal grounding.

Authors:Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin
Title: Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation
Abstract:
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.
中文: 本文提出CACTI和CACTIF两种基于扩散模型的语义一致性风格迁移方法,通过类别自适应归一化和注意力过滤机制,在保持语义边界和结构连贯性的同时有效缩小合成数据与真实数据的领域差距,仅需少量真实数据即可显著提升视觉模型性能。
English: This paper introduces CACTI and CACTIF, two diffusion-based techniques that perform semantically consistent style transfer to bridge the synthetic-to-real domain gap, improving vision model performance with minimal real data by preserving semantic boundaries and structural coherence.

Authors:Pierre Achkar, Tim Gollub, Martin Potthast
Title: Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization
Abstract:
The exponential growth of scientific publications has made it increasingly difficult for researchers to stay updated and synthesize knowledge effectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at https://github.com/webis-de/scolia25-xsum
中文:本文提出XSum,一个基于检索增强生成的科学文献多文档摘要模块化流程,通过动态生成问题并整合检索信息形成连贯摘要,在基准评估中表现出优越性能。
English: This paper introduces XSum, a modular pipeline for scientific multi-document summarization using Retrieval-Augmented Generation, which dynamically generates questions and synthesizes retrieved information into coherent summaries, showing strong performance on benchmark evaluations.

Authors:Wenqing Wu, Chengzhi Zhang, Tong Bao, Yi Zhao
Title: SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers
Abstract:
Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.
中文摘要:本研究通过语言模型探索学术论文中预测新颖性评分的最佳章节组合,发现引言、结果和讨论部分评估效果最佳,而全文分析效果不显著。
English Summary: This study explores optimal section combinations in academic papers for predicting novelty scores using language models, finding that introduction, results, and discussion sections yield the most effective assessment while full-text analysis proves less impactful.

Authors:Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu
Title: ChemMLLM: Chemical Multimodal Large Language Model
Abstract:
Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 116.75\% (4.27 vs 1.97 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.
Chinese: ChemMLLM是一种创新的化学多模态大语言模型,专为分子理解和生成而设计,在多项任务中表现卓越,如在分子图像优化任务上比GPT-4o提升了116.75%,代码已公开。
English: ChemMLLM is a novel multimodal large language model designed for chemical applications that excels in molecule understanding and generation, significantly outperforming existing models across multiple tasks, such as achieving a 116.75% improvement in molecule image optimization over GPT-4o.

Authors:Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu
Title: Efficient Motion Prompt Learning for Robust Visual Tracking
Abstract:
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.
中文: 本文提出了一种轻量级即插即用的运动提示跟踪方法,通过高效的提示学习将运动线索融入现有视觉跟踪器,以极低的训练成本和速度代价显著提升了多个挑战性基准上的跟踪鲁棒性。
English: This paper introduces a lightweight, plug-and-play motion prompt tracking method that enhances vision-based trackers by integrating motion cues through efficient prompt learning, significantly improving robustness across multiple benchmarks with minimal training cost and speed impact.

Authors:Yangyang Wang, Jiawei Gu, Li Long, Xin Li, Li Shen, Zhouyu Fu, Xiangjun Zhou, Xu Jiang
Title: FreshRetailNet-50K: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail
Abstract:
Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K) and code (https://github.com/Dingdong-Inc/frn-50k-baseline}) are openly released.
中文摘要:FreshRetailNet-50K作为首个包含5万条时序数据的大规模基准数据集,通过精确标注缺货事件实现两阶段需求建模,将系统性需求低估从7.37%降至接近零偏差,同时提升预测精度2.73%。
English Summary: FreshRetailNet-50K introduces the first large-scale benchmark with 50,000 hourly sales series and precise stockout annotations, enabling two-stage demand modeling that reduces prediction bias from 7.37% to near-zero while improving accuracy by 2.73%.

Authors:Arjhun Swaminathan, Mete Akgün
Title: Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings
Abstract:
Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
Chinese: 针对图像分类的深度神经网络易受对抗样本攻击,本文提出的目标边缘信息攻击(TEA)通过利用目标图像边缘信息,在黑盒设置中高效生成对抗样本,显著减少查询次数并提升误分类效果。
English: Deep neural networks for image classification are vulnerable to adversarial examples, and the proposed Targeted Edge-informed Attack (TEA) effectively crafts these in black-box settings by using target image edges to reduce queries and enhance misclassification.

Authors:Arjhun Swaminathan, Mete Akgün
Title: Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings
Abstract:
Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
Chinese: 针对图像分类的深度神经网络易受对抗样本攻击,本文提出的目标边缘信息攻击(TEA)通过利用目标图像边缘信息,在黑盒设置中高效生成对抗样本,显著减少查询次数并提升误分类效果。
English: Deep neural networks for image classification are vulnerable to adversarial examples, and the proposed Targeted Edge-informed Attack (TEA) effectively crafts these in black-box settings by using target image edges to reduce queries and enhance misclassification.

Authors:Jiawei Liu, Qisi Chen, Jianshu Zhang, Quan Liu, Defu Lian
Title: EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
Abstract:
Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1\% while also improving accuracy. Our code is available at https://github.com/Lolo1222/EquivPruner.
Chinese Summary: EquivPruner方法通过剪枝语义等价步骤,显著降低了大型语言模型推理中的令牌消耗,在GSM8K任务上实现了48.1%的令牌削减并提升了准确性。
English Summary: The EquivPruner method effectively reduces token consumption in LLM reasoning by pruning semantically equivalent steps, enhancing both efficiency and accuracy, as demonstrated by a 48.1% token reduction on GSM8K.

Authors:Feng Liu, Lixin Zou, Xiangyu Zhao, Min Tang, Liming Dong, Dan Luo, Xiangyang Luo, Chenliang Li
Title: Flow Matching based Sequential Recommender Model
Abstract:
Generative models, particularly diffusion model, have emerged as powerful tools for sequential recommendation. However, accurately modeling user preferences remains challenging due to the noise perturbations inherent in the forward and reverse processes of diffusion-based methods. Towards this end, this study introduces FMRec, a Flow Matching based model that employs a straight flow trajectory and a modified loss tailored for the recommendation task. Additionally, from the diffusion-model perspective, we integrate a reconstruction loss to improve robustness against noise perturbations, thereby retaining user preferences during the forward process. In the reverse process, we employ a deterministic reverse sampler, specifically an ODE-based updating function, to eliminate unnecessary randomness, thereby ensuring that the generated recommendations closely align with user needs. Extensive evaluations on four benchmark datasets reveal that FMRec achieves an average improvement of 6.53% over state-of-the-art methods. The replication code is available at https://github.com/FengLiu-1/FMRec.
Chinese: 本研究提出了FMRec,一种基于流匹配的模型,通过采用直线流轨迹和定制损失函数来增强对噪声的鲁棒性,在四个基准数据集上的评估显示其性能比现有最优方法平均提升了6.53%。
English: This study introduces FMRec, a flow matching-based model that enhances sequential recommendation by using a straight flow trajectory and a modified loss to improve robustness against noise, achieving a 6.53% average improvement over state-of-the-art methods.

Authors:Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia
Title: ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay
Abstract:
Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.
中文摘要:本文提出Agentic Replay Policy Optimization (ARPO)方法,通过复用成功经验和任务筛选策略增强GUI智能体训练,在OSWorld基准测试中取得了领先性能,为基于强化学习的图形界面智能体建立了新基准。
English Summary: This paper introduces Agentic Replay Policy Optimization (ARPO), an end-to-end reinforcement learning method that enhances GUI agent training by reusing successful experiences and implementing task selection strategies, achieving competitive results on the OSWorld benchmark.

Authors:Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
Title: HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
Abstract:
The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.
中文摘要:HiMATE框架基于MQM错误类型构建分层多智能体系统,通过自我反思和智能体间非对称信息讨论,显著提升了机器翻译评估中错误定位与严重性判定的准确性。
English Summary: The HiMATE framework leverages a hierarchical multi-agent system based on MQM error typology to enhance machine translation evaluation, significantly improving error span detection and severity assessment through self-reflection and agent discussions.

Authors:Sampanna Yashwant Kahu, Naman Ahuja
Title: All You Need is "Leet": Evading Hate-speech Detection AI
Abstract:
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
Chinese: 本文设计了黑盒技术,通过生成扰动来规避最先进的仇恨言论检测模型,在保持原意基本不变的同时,成功使86.8%的仇恨文本逃过检测。
English: This paper develops black-box techniques to generate perturbations that evade state-of-the-art hate speech detection models, reducing their effectiveness by 86.8% while minimally altering the original text's meaning.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection
Abstract:
Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: https://github.com/aashish2000/IRONIC
中文摘要:IRONIC是一种新颖的情境学习框架,通过利用多模态连贯关系实现了零样本多模态讽刺检测的最优性能,证明了将语言学和认知原理融入多模态推理设计的重要性。
English Summary: IRONIC is a novel in-context learning framework that uses multi-modal coherence relations to achieve state-of-the-art zero-shot sarcasm detection, demonstrating the importance of integrating linguistic and cognitive principles into multi-modal reasoning.

Authors:Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution
Abstract:
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
Chinese: 作者提出DOVE,一种高效的单步扩散模型用于视频超分辨率,在实现与多步方法相当或更优性能的同时,推理速度提升高达28倍。
English: The authors propose DOVE, an efficient one-step diffusion model for video super-resolution that achieves comparable or better performance than multi-step methods while offering up to 28× faster inference speed.

Authors:Henry X. Liu, Xintao Yan, Haowei Sun, Tinghan Wang, Zhijie Qiao, Haojie Zhu, Shengyin Shen, Shuo Feng, Greg Stevens, Greg McGuire
Title: Behavioral Safety Assessment towards Large-scale Deployment of Autonomous Vehicles
Abstract:
Autonomous vehicles (AVs) have significantly advanced in real-world deployment in recent years, yet safety continues to be a critical barrier to widespread adoption. Traditional functional safety approaches, which primarily verify the reliability, robustness, and adequacy of AV hardware and software systems from a vehicle-centric perspective, do not sufficiently address the AV's broader interactions and behavioral impact on the surrounding traffic environment. To overcome this limitation, we propose a paradigm shift toward behavioral safety, a comprehensive approach focused on evaluating AV responses and interactions within traffic environment. To systematically assess behavioral safety, we introduce a third-party AV safety assessment framework comprising two complementary evaluation components: Driver Licensing Test and Driving Intelligence Test. The Driver Licensing Test evaluates AV's reactive behaviors under controlled scenarios, ensuring basic behavioral competency. In contrast, the Driving Intelligence Test assesses AV's interactive behaviors within naturalistic traffic conditions, quantifying the frequency of safety-critical events to deliver statistically meaningful safety metrics before large-scale deployment. We validated our proposed framework using \texttt{Autoware.Universe}, an open-source Level 4 AV, tested both in simulated environments and on the physical test track at the University of Michigan's Mcity Testing Facility. The results indicate that \texttt{Autoware.Universe} passed 6 out of 14 scenarios and exhibited a crash rate of 3.01e-3 crashes per mile, approximately 1,000 times higher than average human driver crash rate. During the tests, we also uncovered several unknown unsafe scenarios for \texttt{Autoware.Universe}. These findings underscore the necessity of behavioral safety evaluations for improving AV safety performance prior to widespread public deployment.
中文: 本文提出行为安全新范式以解决自动驾驶车辆的交通交互问题,通过包含驾驶执照测试和驾驶智能测试的双重评估框架,实证研究发现其事故率远超人类驾驶员水平。
English: This paper proposes a behavioral safety paradigm to address autonomous vehicles' traffic interactions, introducing a dual-component assessment framework that reveals significant safety gaps compared to human drivers through real-world testing.

Authors:Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Abstract:
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
Chinese: 本文提出AudioTrust框架,通过涵盖六个关键维度的系统性评估方法,针对音频大语言模型的信任度进行测试,发现14种先进模型在面对4,420多个真实场景音频样本时存在显著缺陷。
English: This paper introduces AudioTrust, a comprehensive framework designed to systematically evaluate the trustworthiness of Audio Large Language Models (ALLMs) by addressing audio-specific risks across six key dimensions, revealing significant vulnerabilities in 14 state-of-the-art models when tested with over 4,420 real-world audio samples.

Authors:Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Shun Zhang, Xingjian Du, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Gelei Deng, Haoyang Li, Yiming Li, Xiaobin Zhuang, Tianlong Chen, Qingsong Wen, Tianwei Zhang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, Wenyuan Xu, XiaoFeng Wang, Wei Dong, Xinfeng Li
Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Abstract:
Audio Large Language Models (ALLMs) have gained widespread adoption, yet their trustworthiness remains underexplored. Existing evaluation frameworks, designed primarily for text, fail to address unique vulnerabilities introduced by audio's acoustic properties. We identify significant trustworthiness risks in ALLMs arising from non-semantic acoustic cues, including timbre, accent, and background noise, which can manipulate model behavior. We propose AudioTrust, a comprehensive framework for systematic evaluation of ALLM trustworthiness across audio-specific risks. AudioTrust encompasses six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework implements 26 distinct sub-tasks using a curated dataset of over 4,420 audio samples from real-world scenarios, including daily conversations, emergency calls, and voice assistant interactions. We conduct comprehensive evaluations across 18 experimental configurations using human-validated automated pipelines. Our evaluation of 14 state-of-the-art open-source and closed-source ALLMs reveals significant limitations when confronted with diverse high-risk audio scenarios, providing insights for secure deployment of audio models. Code and data are available at https://github.com/JusperLee/AudioTrust.
Chinese: 本文提出AudioTrust框架,通过涵盖六个关键维度的系统性评估方法,针对音频大语言模型的信任度进行测试,发现14种先进模型在面对4,420多个真实场景音频样本时存在显著缺陷。
English: This paper introduces AudioTrust, a comprehensive framework designed to systematically evaluate the trustworthiness of Audio Large Language Models (ALLMs) by addressing audio-specific risks across six key dimensions, revealing significant vulnerabilities in 14 state-of-the-art models when tested with over 4,420 real-world audio samples.

Authors:Yuqing Yang, Robin Jia
Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
Abstract:
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.
中文: 本研究探讨大型语言模型何时及为何撤回错误答案,发现撤回行为罕见且与模型对事实正确性的内部信念存在因果关联,监督微调可通过优化这些信念显著提升撤回表现。
English: This study investigates when and why large language models (LLMs) retract incorrect answers, finding that retraction is rare and causally linked to the model's internal belief about factual correctness, with supervised fine-tuning shown to improve performance by refining these beliefs.

Authors:Liyan Wang, Weixiang Zhou, Cong Wang, Kin-Man Lam, Zhixun Su, Jinshan Pan
Title: Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey
Abstract:
Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at https://github.com/wlydlut/UHD-Image-Restoration-Survey.
中文: 本文系统综述了超高清图像复原领域的最新进展,涵盖退化模型、基准数据集、深度学习技术及未来研究方向。
English: This paper provides a systematic review of recent advances in ultra-high-definition image restoration, covering degradation models, benchmark datasets, deep learning techniques, and future research directions.

Authors:Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang
Title: EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Abstract:
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
中文: 本文推出首个教育场景多样化基准EduBench,涵盖9大场景的合成数据和多维评估指标,通过人工标注验证有效性,并成功训练出性能媲美顶尖大模型的小型教育语言模型。
English: This paper introduces EduBench, the first diverse benchmark for educational language models, featuring synthetic data across 9 scenarios and multi-dimensional metrics validated through human annotation, with a trained small model matching top large models' performance.

Authors:Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Title: Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
Abstract:
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with faithfulness or hallucination, extracting more precise and disentangled hallucination-related representations. Our analysis demonstrates that interventions along the identified faithful direction can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
中文摘要:本文提出SSL方法,通过稀疏自编码器识别并调控大视觉语言模型中的潜在语义方向,在可忽略的额外时间成本下有效缓解幻觉现象,同时保持语义完整性和跨模型架构的迁移能力。
English Summary: This paper introduces SSL, a plug-and-play method using sparse autoencoders to identify and steer latent directions in large vision-language models, effectively reducing hallucinations while maintaining semantic integrity and transferability across architectures with minimal computational overhead.

Authors:Yuke Zhang
Title: Interpretable Machine Learning for Macro Alpha: A News Sentiment Case Study
Abstract:
This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project's worldwide news feed using FinBERT -- a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language -- to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.
本研究通过结合领域特定的自然语言处理与可解释机器学习,构建了一个基于新闻情感的宏观经济预测框架,在汇率和债券市场中实现了卓越且可解释的投资回报。
This study develops an interpretable machine learning framework using news sentiment analysis to predict financial market returns, demonstrating exceptional performance in foreign exchange and bond markets through rigorous testing.

Authors:Haohan Wang, Xu Shi, Hengyu Zhang, Yashuai Cao, Jintao Wang
Title: Beamforming-Codebook-Aware Channel Knowledge Map Construction for Multi-Antenna Systems
Abstract:
Channel knowledge map (CKM) has emerged as a crucial technology for next-generation communication, enabling the construction of high-fidelity mappings between spatial environments and channel parameters via electromagnetic information analysis. Traditional CKM construction methods like ray tracing are computationally intensive. Recent studies utilizing neural networks (NNs) have achieved efficient CKM generation with reduced computational complexity and real-time processing capabilities. Nevertheless, existing research predominantly focuses on single-antenna systems, failing to address the beamforming requirements inherent to MIMO configurations. Given that appropriate precoding vector selection in MIMO systems can substantially enhance user communication rates, this paper presents a TransUNet-based framework for constructing CKM, which effectively incorporates discrete Fourier transform (DFT) precoding vectors. The proposed architecture combines a UNet backbone for multiscale feature extraction with a Transformer module to capture global dependencies among encoded linear vectors. Experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) deep learning (DL) approaches, yielding a 17\% improvement in RMSE compared to RadioWNet. The code is publicly accessible at https://github.com/github-whh/TransUNet.
Chinese: 本文提出了一种基于TransUNet的信道知识图谱构建框架,通过集成DFT预编码向量满足MIMO波束成形需求,相比现有方法将RMSE指标提升了17%。
English: This paper introduces a TransUNet-based framework for constructing channel knowledge maps (CKM) that incorporates DFT precoding vectors to address MIMO beamforming needs, achieving a 17% RMSE improvement over existing methods.

Authors:Nathan Brady, David Tennyson, Thomas Vandermeulen
Title: Machine Learning the 6d Supergravity Landscape
Abstract:
In this paper, we apply both supervised and unsupervised machine learning algorithms to the study of the string landscape and swampland in 6-dimensions. Our data are the (almost) anomaly-free 6-dimensional $\mathcal{N} = (1,0)$ supergravity models, characterised by the Gram matrix of anomaly coefficients. Our work demonstrates the ability of machine learning algorithms to efficiently learn highly complex features of the landscape and swampland. Employing an autoencoder for unsupervised learning, we provide an auto-classification of these models by compressing the Gram matrix data to 2-dimensions. Through compression, similar models cluster together, and we identify prominent features of these clusters. The autoencoder also identifies outlier models which are difficult to reconstruct. One of these outliers proves to be incredibly difficult to combine with other models such that the $\text{tr}R^{4}$ anomaly vanishes, making its presence in the landscape extremely rare. Further, we utilise supervised learning to build two classifiers predicting (1) model consistency under probe string insertion (precision: 0.78, predicting consistency for 214,837 models with reasonable certainty) and (2) inconsistency under anomaly inflow (precision: 0.91, predicting inconsistency for 1,909,359 models). Notably, projecting these predictions onto the autoencoder's 2-dimensional latent layer shows consistent models clustering together, further indicating that the autoencoder has learnt interesting and complex features of the set of models and potentially offers a novel approach to mapping the landscape and swampland of 6-dimensional supergravity theories.
中文: 本研究运用监督和无监督机器学习分析六维超引力模型,通过降维和预测聚类展示了算法如何有效分类景观特征并识别罕见异常值。
English: This study employs supervised and unsupervised machine learning to analyze 6D supergravity models, demonstrating how algorithms can classify landscape features and identify rare outliers through dimensionality reduction and predictive clustering.

Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Title: Scalable Graph Generative Modeling via Substructure Sequences
Abstract:
Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.
中文: 研究者提出G$^2$PM生成式Transformer框架,通过将图表示为子结构序列来克服图神经网络中消息传递的局限性,在多项任务中展现出卓越的可扩展性和性能优势。
English: The authors propose G$^2$PM, a generative Transformer framework that overcomes message-passing limitations in graph neural networks by representing graphs as sequences of substructures, demonstrating superior scalability and performance across diverse tasks.

Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Title: Scalable Graph Generative Modeling via Substructure Sequences
Abstract:
Graph neural networks (GNNs) have been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance. To this end, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable and transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node/link/graph classification, transfer learning, and cross-graph pretraining -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.
中文: 研究者提出G$^2$PM生成式Transformer框架,通过将图表示为子结构序列来克服图神经网络中消息传递的局限性,在多项任务中展现出卓越的可扩展性和性能优势。
English: The authors propose G$^2$PM, a generative Transformer framework that overcomes message-passing limitations in graph neural networks by representing graphs as sequences of substructures, demonstrating superior scalability and performance across diverse tasks.

Authors:Hyang Cui
Title: LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods
Abstract:
Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
Chinese: 本文提出了一种基于生成的机器翻译质量评估方法,利用仅解码器大语言模型生成高质量参考译文,并通过句子嵌入进行语义相似度评分,其表现优于现有评分基准和外部无参考指标。
English: This paper introduces a generation-based evaluation method for machine translation quality estimation that uses decoder-only large language models to create references and assesses semantic similarity with sentence embeddings, outperforming existing scoring baselines and metrics.

Authors:Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
Title: Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning
Abstract:
Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the $E^3$ metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in $E^3$. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.
中文: 大语言模型常因过度思考导致推理效率低下,而提出的Plan-and-Budget框架通过分解查询并自适应分配计算资源,显著提升了任务处理的效率和准确性。
English: Large Language Models often suffer from overthinking, leading to inefficient reasoning, but the proposed Plan-and-Budget framework addresses this by decomposing queries and adaptively allocating token budgets, significantly improving efficiency and accuracy across tasks.

Authors:Naiqi Li, Peiyuan Liu, Zheng Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Title: Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language
Abstract:
Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: https://github.com/naiqili/Logic-of-Thought.
Chinese: Logic-of-Thought (Logot) 框架通过将大型语言模型与逻辑编程相结合,将谜题转化为答案集程序进行求解,在评估中实现了接近完美的准确性。
English: The Logic-of-Thought (Logot) framework integrates large language models with logic programming to solve complex puzzles by translating them into answer set programs, achieving near-perfect accuracy in evaluations.

Authors:Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen
Title: Continually Self-Improving Language Models for Bariatric Surgery Question--Answering
Abstract:
While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.
中文: 减重与代谢手术需多学科全程协作,但医疗差异常阻碍患者获取可靠信息,为此开发的bRAGgen自适应AI模型能动态整合实时医学证据,确保回答准确及时,经专家验证表现显著优于现有模型。
English: Bariatric and metabolic surgery requires continuous multidisciplinary care, but healthcare disparities often limit access to reliable information, prompting the development of bRAGgen, an adaptive AI model that integrates real-time medical evidence to provide accurate, up-to-date responses, validated as superior through expert evaluation.

Authors:Ziqing Wang, Kexin Zhang, Zihan Zhao, Yibo Wen, Abhishek Pandey, Han Liu, Kaize Ding
Title: A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization
Abstract:
Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at https://github.com/REAL-Lab-NU/Awesome-LLM-Centric-Molecular-Discovery.
大语言模型通过文本引导探索化学空间,并支持分子生成与优化等核心任务,正在彻底改变分子发现领域,本综述对此进行了系统梳理与前瞻展望。
Large language models are revolutionizing molecular discovery by enabling text-guided exploration of chemical spaces and supporting key tasks like molecule generation and optimization, as detailed in this comprehensive survey.

Authors:Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang
Title: OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates
Abstract:
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/OSCAR.
中文: OSCAR是一种新颖的一步扩散编解码器,通过将压缩潜在表示建模为扩散轨迹上的噪声变体,实现了跨多比特率的高效图像压缩,在显著降低计算开销的同时获得了优越的性能。
English: OSCAR is a novel one-step diffusion codec that enables efficient image compression across multiple bit-rates by modeling compressed latents as noisy variants along a diffusion trajectory, achieving superior performance with significantly reduced computational overhead.

Authors:Gagan Bhatia, Maxime Peyrard, Wei Zhao
Title: Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Abstract:
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day). Our datasets and code are made publicly available \href{https://github.com/gagan3012/date-fragments}{here}.
中文: 现代BPE分词器将日期分割成无意义片段,影响时间推理,本研究提出了日期碎片化度量标准、测试基准,并揭示了大语言模型通过涌现的抽象机制重组这些片段的过程。
English: Modern BPE tokenizers fragment calendar dates into meaningless pieces, impairing temporal reasoning, but this work introduces a date fragmentation metric, a benchmark for testing, and reveals how large language models reassemble these fragments through an emergent abstraction mechanism.

Authors:Duy-Nam Bui, Manh Duong Phung, Hung Pham Duy
Title: Event-based Reconfiguration Control for Time-varying Formation of Robot Swarms in Narrow Spaces
Abstract:
This study proposes an event-based reconfiguration control to navigate a robot swarm through challenging environments with narrow passages such as valleys, tunnels, and corridors. The robot swarm is modeled as an undirected graph, where each node represents a robot capable of collecting real-time data on the environment and the states of other robots in the formation. This data serves as the input for the controller to provide dynamic adjustments between the desired and straight-line configurations. The controller incorporates a set of behaviors, designed using artificial potential fields, to meet the requirements of goal-oriented motion, formation maintenance, tailgating, and collision avoidance. The stability of the formation control is guaranteed via the Lyapunov theorem. Simulation and comparison results show that the proposed controller not only successfully navigates the robot swarm through narrow spaces but also outperforms other established methods in key metrics including the success rate, heading order, speed, travel time, and energy efficiency. Software-in-the-loop tests have also been conducted to validate the controller's applicability in practical scenarios. The source code of the controller is available at https://github.com/duynamrcv/erc.
中文摘要:本研究提出了一种基于事件的重构控制方法,通过实时数据和人工势场动态调整机器人集群编队,使其能够穿越狭窄通道,仿真和测试验证了该方法在成功率、效率和实际应用性方面均优于现有技术。
English Summary: This research introduces an event-based reconfiguration control method that enables a robot swarm to navigate through narrow passages by dynamically adjusting formations using real-time data and artificial potential fields, with simulations and tests confirming its superior performance in success rate, efficiency, and practical applicability.

Authors:Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Title: Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Abstract:
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
中文: 本文提出了两种衡量混合专家模型局部路由一致性的指标,发现每层都采用MoE且无共享专家的模型具有最高一致性,通过缓存大小约为活跃专家两倍即可实现内存高效部署。
English: This paper introduces two metrics to measure local routing consistency in Mixture-of-Experts models, revealing that models with MoE on every layer and no shared experts show the highest consistency, enabling memory-efficient deployment with cache sizes about twice the active experts.

Authors:Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Title: Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Abstract:
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
中文: 本文提出了两种衡量混合专家模型局部路由一致性的指标,发现每层都采用MoE且无共享专家的模型具有最高一致性,通过缓存大小约为活跃专家两倍即可实现内存高效部署。
English: This paper introduces two metrics to measure local routing consistency in Mixture-of-Experts models, revealing that models with MoE on every layer and no shared experts show the highest consistency, enabling memory-efficient deployment with cache sizes about twice the active experts.

Authors:Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
Title: OViP: Online Vision-Language Preference Learning for VLM Hallucination
Abstract:
Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. Although recent training-based approaches aim to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that do not reflect actual model errors, thus limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP not only reduces hallucinations while preserving core multi-modal capabilities, but also substantially improves training efficiency. Code is available at https://github.com/lsjlsj35/Online-Vision-Language-Preference-Learning-for-VLM-Hallucination.
中文: 本研究提出的在线视觉语言偏好学习(OViP)框架通过动态构建模型自身幻觉输出的对比训练数据,并利用扩散模型合成负样本图像实现实时监督,在保持核心多模态能力的同时显著缓解了视觉语言模型的幻觉问题。
English: The proposed Online Vision-language Preference Learning (OViP) framework dynamically generates contrastive training data from the model's own hallucinations and synthesizes negative images to provide real-time supervision, effectively reducing visual-text misalignment while maintaining multimodal performance.

Authors:Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Title: Pre-training Large Memory Language Models with Internal and External Knowledge
Abstract:
Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
Chinese: 有限记忆语言模型(LMLM)在预训练期间将事实知识外部化存储到数据库,既实现了与大型模型相媲美的性能,又提供了可编辑和可验证的知识库。
English: Limited Memory Language Models (LMLM) externalize factual knowledge to databases during pre-training, enabling competitive performance with larger models while providing editable and verifiable knowledge bases.

Authors:Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Ryan Thomas Noonan, Dongyoung Go, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Title: Pre-training Limited Memory Language Models with Internal and External Knowledge
Abstract:
Neural language models are black-boxes--both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We introduce Limited Memory Language Models (LMLM), a new class of language models that externalizes factual knowledge to external database during pre-training rather than memorizing them. Our pre-training approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases.
Chinese: 有限记忆语言模型(LMLM)在预训练期间将事实知识外部化存储到数据库,既实现了与大型模型相媲美的性能,又提供了可编辑和可验证的知识库。
English: Limited Memory Language Models (LMLM) externalize factual knowledge to databases during pre-training, enabling competitive performance with larger models while providing editable and verifiable knowledge bases.

Authors:Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
Title: Training Step-Level Reasoning Verifiers with Formal Verification Tools
Abstract:
Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.
中文摘要:本文提出FoVer方法,利用形式化验证工具自动生成过程奖励模型的步骤级训练数据,无需人工标注即可实现跨任务泛化,并在多个推理基准测试中达到领先性能。
English Summary: This paper introduces FoVer, an automated method using formal verification tools to generate step-level training data for Process Reward Models (PRMs), enabling cross-task generalization and achieving state-of-the-art performance across diverse reasoning benchmarks without human annotation.

Authors:Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
Title: Generalizable Process Reward Models via Formally Verified Training Data
Abstract:
Process Reward Models (PRMs), which provide step-level feedback on reasoning traces generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: creating PRM training data requires costly human annotation to label accurate step-level errors, and existing PRMs are limited to math reasoning domains. In response to these gaps, this paper aims to enable automatic synthesis of accurate PRM training data and the generalization of PRMs to diverse reasoning tasks beyond math reasoning. We propose FoVer, an approach to synthesize PRM training data with accurate step-level error labels automatically annotated by formal verification tools, such as Z3 and Isabelle. To show the practical effectiveness of FoVer, we synthesize a training dataset by annotating step-level error labels on LLM responses to formal logic and theorem proving tasks, without relying on human annotation. While FoVer creates training data with symbolic tasks compatible with formal verification, our experiments show that PRMs trained on our dataset exhibit cross-task generalization, enabling a single PRM to effectively perform verification across diverse reasoning tasks. Specifically, LLM-based PRMs trained with FoVer significantly outperform PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The dataset and code are in the supplementary material and will be made public. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.
中文摘要:本文提出FoVer方法,利用形式化验证工具自动生成过程奖励模型的步骤级训练数据,无需人工标注即可实现跨任务泛化,并在多个推理基准测试中达到领先性能。
English Summary: This paper introduces FoVer, an automated method using formal verification tools to generate step-level training data for Process Reward Models (PRMs), enabling cross-task generalization and achieving state-of-the-art performance across diverse reasoning benchmarks without human annotation.

Authors:Chih-Kai Yang, Neo S. Ho, Hung-yi Lee
Title: Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Abstract:
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
Chinese: 本研究针对大型音频语言模型评估标准零散的问题,首次提出系统化分类框架,涵盖听觉处理、知识推理、对话能力和伦理安全四大维度,为领域发展提供首个全面评估指南与资源库。
English: This survey introduces a systematic taxonomy for evaluating large audio-language models across four dimensions—auditory processing, knowledge reasoning, dialogue ability, and ethical safety—addressing fragmented benchmarks and providing the first comprehensive evaluation framework for the field.

Authors:Chih-Kai Yang, Neo S. Ho, Hung-yi Lee
Title: Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Abstract:
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
Chinese: 本研究针对大型音频语言模型评估标准零散的问题,首次提出系统化分类框架,涵盖听觉处理、知识推理、对话能力和伦理安全四大维度,为领域发展提供首个全面评估指南与资源库。
English: This survey introduces a systematic taxonomy for evaluating large audio-language models across four dimensions—auditory processing, knowledge reasoning, dialogue ability, and ethical safety—addressing fragmented benchmarks and providing the first comprehensive evaluation framework for the field.

Authors:Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding
Abstract:
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
中文: MoRE-Brain提出了一种基于脑网络原理的混合专家框架,通过分层处理和双阶段路由机制,实现了从fMRI信号到图像的高保真、可适应且可解释的视觉重建,显著提升了重建质量与神经机制的可解释性。
English: MoRE-Brain introduces a neuro-inspired Mixture-of-Experts framework that achieves high-fidelity, adaptable, and interpretable visual reconstruction from fMRI through hierarchical processing and dual-stage routing, advancing both reconstruction quality and mechanistic insight.

Authors:Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding
Abstract:
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
中文: MoRE-Brain提出了一种基于脑网络原理的混合专家框架,通过分层处理和双阶段路由机制,实现了从fMRI信号到图像的高保真、可适应且可解释的视觉重建,显著提升了重建质量与神经机制的可解释性。
English: MoRE-Brain introduces a neuro-inspired Mixture-of-Experts framework that achieves high-fidelity, adaptable, and interpretable visual reconstruction from fMRI through hierarchical processing and dual-stage routing, advancing both reconstruction quality and mechanistic insight.

Authors:Tony Montes, Fernando Lozano
Title: ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Abstract:
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.
中文摘要:本研究提出了一种基于大语言模型的零样本视频问答智能体,通过结合思维链推理与YOLO-World目标追踪技术,在多个基准测试中实现最优性能,同时支持时间定位交叉验证以提升输出可靠性。
English Summary: This work introduces an LLM-brained agent for zero-shot VideoQA that integrates Chain-of-Thought reasoning with YOLO-World object tracking, achieving state-of-the-art performance across multiple benchmarks while enabling cross-verification of grounding timeframes for enhanced reliability.

Authors:Can Rong, Xin Zhang, Yanxin Xi, Hongjie Sui, Jingtao Ding, Yong Li
Title: Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities
Abstract:
Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at https://github.com/tsinghua-fib-lab/generate-od-pubtools.
中文: GlODGen是一种创新工具,通过从卫星图像中提取城市语义特征并结合人口数据,利用图扩散模型为全球城市生成通勤起讫点流量数据,在不同大陆的多样化城市环境中展现出卓越的准确性和泛化能力。
English: GlODGen is a novel tool that generates commuting origin-destination flow data for global cities by extracting urban semantic features from satellite imagery and combining them with population data through graph diffusion models, demonstrating high accuracy and generalizability across diverse urban environments.

Authors:Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye
Title: AutoData: A Multi-Agent System for Open Web Data Collection
Abstract:
The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.
中文摘要:AutoData是一种创新的多智能体系统,通过自然语言指令实现自动化网络数据采集,其基于超图的架构显著提升了效率并降低了现有方法的成本。
English Summary: AutoData is a novel multi-agent system that automates web data collection using natural language instructions, featuring a hypergraph-based architecture to enhance efficiency and reduce costs compared to existing methods.

Authors:Kai Yin, Xiangjue Dong, Chengkai Liu, Lipai Huang, Yiming Xiao, Zhewei Liu, Ali Mostafavi, James Caverlee
Title: DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management
Abstract:
Effective disaster management requires timely access to accurate and contextually relevant information. Existing Information Retrieval (IR) benchmarks, however, focus primarily on general or specialized domains, such as medicine or finance, neglecting the unique linguistic complexity and diverse information needs encountered in disaster management scenarios. To bridge this gap, we introduce DisastIR, the first comprehensive IR evaluation benchmark specifically tailored for disaster management. DisastIR comprises 9,600 diverse user queries and more than 1.3 million labeled query-passage pairs, covering 48 distinct retrieval tasks derived from six search intents and eight general disaster categories that include 301 specific event types. Our evaluations of 30 state-of-the-art retrieval models demonstrate significant performance variances across tasks, with no single model excelling universally. Furthermore, comparative analyses reveal significant performance gaps between general-domain and disaster management-specific tasks, highlighting the necessity of disaster management-specific benchmarks for guiding IR model selection to support effective decision-making in disaster management scenarios. All source codes and DisastIR are available at https://github.com/KaiYin97/Disaster_IR.
中文: DisastIR是首个专为灾害管理设计的全面信息检索基准,包含9600个查询和130万标注数据对,旨在解决该领域独特复杂性;评估显示模型性能差异显著,凸显了专用基准的必要性。
English: DisastIR is the first comprehensive information retrieval benchmark designed for disaster management, featuring 9,600 queries and 1.3 million labeled pairs to address the domain's unique complexities, with evaluations showing significant performance gaps among models and the necessity for specialized benchmarks.

Authors:Penghao Wu, Lewei Lu, Ziwei Liu
Title: Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
Abstract:
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.
中文: 大型多模态模型在处理视觉标记时存在计算冗余,ProxyV通过引入代理视觉标记来减轻计算负担,不仅保持性能,还能在与标记削减方法结合时进一步提升效率。
English: Large multimodal models face computational inefficiency with visual tokens, so ProxyV introduces proxy tokens to reduce processing load without losing performance and can even enhance it when combined with token reduction methods.

Authors:Satoshi Kosugi
Title: Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization
Abstract:
Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at https://github.com/satoshi-kosugi/powerful-attention.
Chinese: 本文提出了一种无需微调的样例图像着色方法,利用预训练扩散模型的自注意力模块实现精确语义匹配和双重注意力引导的颜色迁移,在图像质量和参考保真度上均优于现有技术。
English: This paper introduces a fine-tuning-free method for exemplar-based image colorization that utilizes the self-attention module of a pre-trained diffusion model to achieve accurate semantic matching and dual attention-guided color transfer, resulting in superior image quality and fidelity compared to existing techniques.

Authors:Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu
Title: GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Abstract:
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.
中文摘要:本文分析了图形用户界面智能体训练中的三大挑战,并提出快速思考模板、边界框约束和优化强化学习目标三项针对性解决方案,使GUI-G1-3B模型在界面定位任务中达到最新最优性能。
English Summary: This paper analyzes challenges in GUI agent training and proposes three targeted solutions—a Fast Thinking Template, box size constraints, and a revised RL objective—enabling their GUI-G1-3B model to achieve state-of-the-art grounding performance.

Authors:Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Title: MMaDA: Multimodal Large Diffusion Language Models
Abstract:
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
中文摘要:MMaDA是一种统一的多模态扩散基础模型,通过模态无关架构、混合思维链微调和统一强化学习算法,在文本推理、多模态理解和文生图任务中均实现了领先性能。
English Summary: MMaDA is a unified multimodal diffusion foundation model that integrates a modality-agnostic architecture, mixed chain-of-thought fine-tuning, and a unified RL algorithm to achieve state-of-the-art performance across textual reasoning, multimodal understanding, and text-to-image generation tasks.

Authors:Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang
Title: STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.
Chinese: 本研究提出STAR-R1创新强化学习框架,通过奖励部分正确性和惩罚低效探索,解决了多模态模型在空间推理中的不足,在11项指标上实现最优性能,跨视角场景性能提升23%。
English: The study introduces STAR-R1, a novel reinforcement learning framework that overcomes limitations in multimodal models' spatial reasoning by rewarding partial correctness and penalizing inefficiencies, achieving state-of-the-art performance across 11 metrics with a 23% improvement in cross-view scenarios.

Authors:Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang
Title: VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Abstract:
Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
Chinese: 本文提出了VerifyBench和VerifyBench-Hard两个基准测试,专门用于评估推理模型中基于参考的奖励系统,填补了现有验证器评估的空白,并揭示了特别是较小模型在验证器准确性方面仍需显著提升的空间。
English: This paper introduces VerifyBench and VerifyBench-Hard, two benchmarks designed to evaluate reference-based reward systems for reinforcement learning in reasoning models, addressing current gaps in verifier assessment and revealing significant improvement opportunities especially for smaller models.

Authors:Danna Zheng, Mirella Lapata, Jeff Z. Pan
Title: Long-Form Information Alignment Evaluation Beyond Atomic Facts
Abstract:
Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.
中文: 本文提出MontageLie基准测试,通过组合真实陈述构建欺骗性叙述来揭示现有事实核查方法的脆弱性,并推出DoveScore框架,通过联合验证事实准确性和事件顺序一致性,显著提升了评估鲁棒性。
English: This paper introduces MontageLie, a benchmark revealing vulnerabilities in current fact-checking methods by creating deceptive narratives from truthful statements, and proposes DoveScore, a robust framework that improves evaluation accuracy by verifying both factual accuracy and event-order consistency.

Authors:Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
Title: dKV-Cache: The Cache for Diffusion Language Models
Abstract:
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
中文摘要:扩散语言模型通过提出延迟KV缓存机制,实现了2-10倍的推理加速,在保持甚至提升多项语言任务性能的同时,显著缩小了与自回归模型的速度差距。
English Summary: Diffusion Language Models (DLMs) have overcome their slow inference limitation through a novel delayed KV-Cache mechanism that achieves 2-10x speedup while maintaining or even enhancing performance on various language tasks.

Authors:Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, Xin Eric Wang
Title: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Abstract:
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at https://github.com/eric-ai-lab/Soft-Thinking.
中文: Soft Thinking是一种无需训练的方法,通过在连续概念空间中生成抽象概念标记来模拟人类推理,提高准确性和效率,同时保持可解释性。
English: Soft Thinking is a training-free method that enhances reasoning by generating abstract concept tokens in a continuous space, improving accuracy and efficiency while maintaining interpretability.

Authors:Weihao Xia, Cengiz Oztireli
Title: Exploring The Visual Feature Space for Multimodal Neural Decoding
Abstract:
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.
中文: 本研究提出了一种零样本多模态脑解码方法,通过整合多模态大语言模型中的视觉特征空间,解决了现有方法在细节描述上的局限性,从而提升了神经解码在不同粒度上的精确性。
English: This study introduces a zero-shot multimodal brain decoding method that integrates vision feature spaces from Multimodal Large Language Models to enhance neural decoding precision across multiple granularities, addressing the limitations of coarse interpretations in existing approaches.

Authors:Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, Nantheera Anantrasirichai
Title: RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction
Abstract:
Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics. The code of RUSplatting is available at https://github.com/theflash987/RUSplatting and the dataset Submerged3D can be downloaded at https://zenodo.org/records/15482420.
中文: 本文提出一种增强型高斯泼溅框架,通过解耦RGB学习、帧插值和新型损失函数解决水下场景重建中的颜色失真、稀疏视角和噪声问题,在新深海数据集上实现了最优性能。
English: This paper introduces an enhanced Gaussian Splatting framework that improves underwater scene reconstruction by addressing color distortion, sparse views, and noise through decoupled RGB learning, frame interpolation, and a novel loss function, achieving state-of-the-art results on a new deep-sea dataset.

Authors:Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang
Title: DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
Abstract:
Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl-gaurav/Debate-Train-Evolve
中文:DTE框架通过多智能体辩论和新型提示策略自主提升大语言模型的推理能力,在多项基准测试中无需外部监督即实现显著准确率提升。
English: The DTE framework enhances large language models' reasoning autonomously through multi-agent debates and a novel prompting strategy, achieving significant accuracy gains across multiple benchmarks without external supervision.

Authors:Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Title: LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
Abstract:
Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89\% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on https://github.com/caskcsg/LyapLock.
中文: 大语言模型常包含错误知识,为此提出的LyapLock框架结合排队论和李雅普诺夫优化,将顺序编辑分解为可处理的子问题,在确保长期知识保留的同时,将顺序编辑能力扩展至超过10,000次,并将编辑效果提升11.89%。
English: Large Language Models often contain incorrect knowledge, prompting the development of LyapLock, a novel model editing framework that uses queuing theory and Lyapunov optimization to efficiently handle over 10,000 sequential edits while ensuring long-term knowledge preservation and boosting editing efficacy by 11.89%.

Authors:Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Shang Wu, Yu Cao, Caiwen Ding, Yang, Zhao
Title: HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Abstract:
Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively. The code of HDLxGraph and collected HDLSearch benchmark are available at https://github.com/Nick-Zheng-Q/HDLxGraph.
中文摘要:HDLxGraph是一个创新框架,将图检索增强生成与大语言模型相结合,通过硬件专用图表示显著提升了现实硬件设计任务中的代码搜索、调试和完成性能。
English Summary: HDLxGraph is a novel framework that integrates Graph Retrieval Augmented Generation with Large Language Models, using hardware-specific graph representations to significantly enhance performance in real-world hardware design tasks such as code search, debugging, and completion.

Authors:Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang
Title: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
Abstract:
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.
Chinese: 使用专有数据对开源大语言模型进行微调存在严重风险,模型创建者可通过简单的后门训练提取下游私有数据,在理想条件下提取率高达94.9%。
English: Fine-tuning open-source large language models with proprietary data poses a significant risk, as creators can extract private downstream data through simple backdoor training, achieving extraction rates as high as 94.9% in ideal settings.

Authors:Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, Yu Zhou
Title: The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection
Abstract:
Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at https://github.com/pd162/LTB.
中文摘要:场景文本检测器在现实应用中表现不佳,主要因为针对学术基准的微调差距和处理长尾文本分布的困难,为此提出了联合数据集学习协议和新基准,并引入了自监督基线方法。
English Summary: Scene text detectors often underperform in real-world applications due to a fine-tuning gap that prioritizes academic benchmarks and struggles with long-tailed text distributions, prompting the proposal of joint-dataset learning and a new benchmark with a self-supervised baseline method.

Authors:Pujun Xue, Junyi Ge, Xiaotong Jiang, Siyang Song, Zijian Wu, Yupeng Huo, Weicheng Xie, Linlin Shen, Xiaoqin Zhou, Xiaofeng Liu, Min Gu
Title: Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking
Abstract:
Malocclusion is a major challenge in orthodontics, and its complex presentation and diverse clinical manifestations make accurate localization and diagnosis particularly important. Currently, one of the major shortcomings facing the field of dental image analysis is the lack of large-scale, accurately labeled datasets dedicated to malocclusion issues, which limits the development of automated diagnostics in the field of dentistry and leads to a lack of diagnostic accuracy and efficiency in clinical practice. Therefore, in this study, we propose the Oral and Maxillofacial Natural Images (OMNI) dataset, a novel and comprehensive dental image dataset aimed at advancing the study of analyzing dental images for issues of malocclusion. Specifically, the dataset contains 4166 multi-view images with 384 participants in data collection and annotated by professional dentists. In addition, we performed a comprehensive validation of the created OMNI dataset, including three CNN-based methods, two Transformer-based methods, and one GNN-based method, and conducted automated diagnostic experiments for malocclusion issues. The experimental results show that the OMNI dataset can facilitate the automated diagnosis research of malocclusion issues and provide a new benchmark for the research in this field. Our OMNI dataset and baseline code are publicly available at https://github.com/RoundFaceJ/OMNI.
中文摘要:本研究提出了OMNI数据集,这是一个包含专业标注的全面牙科图像集合,旨在解决错颌畸形诊断中缺乏大规模数据的问题,并通过多种验证方法证明了其在推动自动化诊断研究方面的有效性。
English Summary: The study introduces the OMNI dataset, a comprehensive collection of dental images with professional annotations, to address the lack of large-scale data in malocclusion diagnosis and demonstrates its effectiveness in advancing automated diagnostic research through various validation methods.

Authors:Iuliia Kotseruba, John K. Tsotsos
Title: SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks
Abstract:
Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $\textbf{S}$hutter speed, ISO se$\textbf{N}$sitivity, and $\textbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP
中文: 深度学习计算机视觉模型在图像扰动下的泛化能力不足,主要源于相机参数和光照等采集条件的偏差;SNAP基准测试表明,即使在曝光良好的图像上,模型性能也显著低于人类水平且对相机设置的微小变化敏感。
English: Deep learning computer vision models struggle with generalization across image perturbations, particularly due to biases in capture conditions like camera settings and lighting, as demonstrated by the SNAP benchmark which reveals significant performance drops even on well-exposed images compared to human accuracy.

Authors:Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool
Title: LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/
Chinese: Lens基准通过3.4K张当代图像和6万多个问题构建了包含感知、理解与推理的三层评估体系,旨在检验多模态大模型的协同认知能力,当前顶尖模型在推理任务中的准确率均未超过60%。
English: The Lens benchmark introduces a multi-level evaluation framework with 3.4K contemporary images and 60K+ questions to assess MLLMs' synergistic capabilities across perception, understanding, and reasoning tasks, where current state-of-the-art models achieve below 60% accuracy in reasoning.

Authors:Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He
Title: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Abstract:
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
中文: 本文提出LASER及其增强版LASER-D方法,通过基于长度的奖励塑造机制减少大推理模型的输出冗余,在显著压缩推理过程的同时实现了更优的性能与效率平衡。
English: This paper introduces LASER and its enhanced version LASER-D, RL-based methods that use length-based reward shaping to reduce redundancy in Large Reasoning Models, achieving superior efficiency and performance with significantly shorter reasoning traces.

Authors:David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, Mrinmaya Sachan
Title: From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning
Abstract:
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.
中文摘要:本研究提出一种基于强化学习的对齐框架,通过强调引导式解题而非直接给答案,将大语言模型快速适配为高效辅导教师,在无需人工标注的情况下使70亿参数模型达到与专业模型相当的教学效果,同时保持推理能力并可通过思维标签增强教学策略的可解释性。
English Summary: This study introduces a reinforcement learning framework that aligns large language models with effective tutoring principles by prioritizing guided problem-solving over direct answers, achieving comparable performance to proprietary models while preserving reasoning capabilities and offering interpretability through instructional planning tags.

Authors:Xinyi Lu, Aditya Mahesh, Zejia Shen, Mitchell Dudley, Larissa Sano, Xu Wang
Title: Exploring LLM-Generated Feedback for Economics Essays: How Teaching Assistants Evaluate and Envision Its Use
Abstract:
This project examines the prospect of using AI-generated feedback as suggestions to expedite and enhance human instructors' feedback provision. In particular, we focus on understanding the teaching assistants' perspectives on the quality of AI-generated feedback and how they may or may not utilize AI feedback in their own workflows. We situate our work in a foundational college Economics class, which has frequent short essay assignments. We developed an LLM-powered feedback engine that generates feedback on students' essays based on grading rubrics used by the teaching assistants (TAs). To ensure that TAs can meaningfully critique and engage with the AI feedback, we had them complete their regular grading jobs. For a randomly selected set of essays that they had graded, we used our feedback engine to generate feedback and displayed the feedback as in-text comments in a Word document. We then performed think-aloud studies with 5 TAs over 20 1-hour sessions to have them evaluate the AI feedback, contrast the AI feedback with their handwritten feedback, and share how they envision using the AI feedback if they were offered as suggestions. The study highlights the importance of providing detailed rubrics for AI to generate high-quality feedback for knowledge-intensive essays. TAs considered that using AI feedback as suggestions during their grading could expedite grading, enhance consistency, and improve overall feedback quality. We discuss the importance of decomposing the feedback generation task into steps and presenting intermediate results, in order for TAs to use the AI feedback.
该项目探讨了利用AI生成反馈辅助助教评分,通过评估其质量与工作流程整合,发现基于详细评分标准时能加快评分速度并提高一致性。
This project explores using AI-generated feedback to assist teaching assistants in grading by evaluating its quality and integration into their workflows, finding it can speed up grading and improve consistency when guided by detailed rubrics.

Authors:Haocheng Ju, Bin Dong
Title: MIRB: Mathematical Information Retrieval Benchmark
Abstract:
Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
中文: 本文提出MIRB基准,通过四项任务和12个数据集统一评估数学信息检索系统,填补了标准化测评空白,并对13种模型进行分析以推动数学领域检索技术的发展。
English: This paper introduces MIRB, a unified benchmark for evaluating Mathematical Information Retrieval across four tasks and 12 datasets, addressing the lack of standardized assessment and analyzing 13 models to advance domain-specific retrieval systems.

Authors:Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li, Laurence T. Yang, Weidong Zhang, Sam Kwong
Title: Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation
Abstract:
With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.
中文摘要:针对分割一切模型(SAM)在水下实例分割任务中因领域知识缺乏和计算需求高而存在的性能局限,本研究提出高效模型UWSAM,通过知识蒸馏和自动提示生成技术,在多个水下数据集上实现了显著性能突破。
English Summary: The Segment Anything Model (SAM) struggles with underwater instance segmentation due to domain expertise gaps and high computational demands, leading to the development of UWSAM—an efficient model using knowledge distillation and automatic prompt generation that achieves superior performance on underwater datasets.

Authors:Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li, Laurence T. Yang, Weidong Zhang, Sam Kwong
Title: Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation
Abstract:
With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.
中文摘要:针对分割一切模型(SAM)在水下实例分割任务中因领域知识缺乏和计算需求高而存在的性能局限,本研究提出高效模型UWSAM,通过知识蒸馏和自动提示生成技术,在多个水下数据集上实现了显著性能突破。
English Summary: The Segment Anything Model (SAM) struggles with underwater instance segmentation due to domain expertise gaps and high computational demands, leading to the development of UWSAM—an efficient model using knowledge distillation and automatic prompt generation that achieves superior performance on underwater datasets.

Authors:Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Title: Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
中文: 提出的自适应难负样本扰动学习(AHNPL)方法通过生成基于图像的难负样本并采用自适应对比学习,有效提升了视觉语言模型在组合推理任务上的性能。
English: The proposed Adaptive Hard Negative Perturbation Learning (AHNPL) method enhances Vision-Language Models by generating image-based hard negatives and employing adaptive contrastive learning to improve performance on compositional reasoning tasks.

Authors:Andrew Caunes, Thierry Chateau, Vincent Fremont
Title: seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation
Abstract:
3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at https://github.com/andrewcaunes/ia4markings
中文: 本研究提出一种多视角投影框架,通过将激光雷达扫描转换为合成二维数据进行训练,有效解决三维语义分割中的领域偏移问题,在自动驾驶数据集的无监督领域自适应任务中取得最优结果,并在领域泛化方面接近最佳水平。
English: This study introduces a multi-view projection framework that addresses domain shift in 3D semantic segmentation by converting Lidar scans into synthetic 2D data for training, achieving top results in unsupervised domain adaptation and near state-of-the-art in domain generalization on autonomous driving datasets.

Authors:Yiyun Zhou, Chang Yao, Jingyuan Chen
Title: CoLA: Collaborative Low-Rank Adaptation
Abstract:
The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, and introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices $A$ and $B$. Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available at https://github.com/zyy-2001/CoLA.
中文摘要:提出的CoLA架构通过优化初始化和协作策略增强了LoRA的灵活性与效率,在低样本场景下相比现有参数高效微调方法展现出更优性能。
English Summary: The proposed CoLA architecture enhances LoRA's flexibility and efficiency through optimized initialization and collaborative strategies, demonstrating superior performance in low-sample scenarios compared to existing parameter-efficient fine-tuning methods.

Authors:Zeqing Wang, Shiyuan Zhang, Chengpei Tang, Keze Wang
Title: TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Abstract:
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \href{https://github.com/Zeqing-Wang/TimeCausality }{TimeCausality}.
中文: TimeCausality基准测试表明,当前包括顶尖开源模型和GPT-4o等闭源模型在内的视觉语言模型,在时间因果推理方面表现显著不足,这揭示了其开发与评估体系中的重要缺陷。
English: The TimeCausality benchmark reveals that current Vision-Language Models, including top open-source and closed-source ones like GPT-4o, significantly underperform in reasoning about temporal causality, highlighting a critical gap in their development and evaluation.

Authors:Raza Imam, Rufael Marew, Mohammad Yaqub
Title: On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?
Abstract:
Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.
中文: 医学视觉语言模型在噪声条件下表现不佳,为此我们建立了MediMeta-C基准并开发了RobustMedCLIP适配方法,通过少量样本调优显著提升了模型的抗干扰能力。
English: Medical Vision-Language Models show excellent generalization but perform poorly under noisy conditions, prompting the creation of MediMeta-C benchmark and RobustMedCLIP adaptation to enhance robustness through few-shot tuning.

Authors:Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Title: Efficient Differentiable Approximation of Generalized Low-rank Regularization
Abstract:
Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-$p$ norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https://github.com/naiqili/EDLRR.
中文: 本文提出了一种高效可微分的广义低秩正则化近似方法,能够以即插即用方式整合到损失函数中,具备GPU友好操作和严格收敛性证明,实验结果表明该方法兼具多功能性和高效性。
English: This paper introduces an efficient differentiable approximation for generalized low-rank regularization, enabling plug-and-play integration into loss functions with GPU-friendly operations and proven convergence, while experimental results validate its versatility and efficiency.

Authors:Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen
Title: Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
Abstract:
The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.
中文: 大型音频语言模型在对抗越狱攻击时存在严重安全漏洞,AJailBench基准测试表明,即使是细微且语义保持不变的音频扰动,也会显著削弱领先模型的防护性能。
English: Large Audio Language Models (LAMs) face significant safety vulnerabilities to jailbreak attacks, as demonstrated by the AJailBench benchmark, which reveals that even subtle, semantically consistent audio perturbations can drastically compromise their defenses.

Authors:Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang
Title: How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in https://github.com/thu-coai/LRM-Safety-Study.
中文: 本研究通过监督微调发现,解决关键失效模式并使用简化推理过程可显著提升大型推理模型的安全性,同时混合数学推理数据有助于在安全性和过度拒绝之间取得平衡。
English: This study reveals that supervised fine-tuning can significantly improve the safety of Large Reasoning Models by addressing key failure patterns and using simplified reasoning processes, while balancing safety with reasoning capabilities through mixed training data.

Authors:DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
Title: Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Abstract:
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
Chinese: 研究表明,视觉语言模型在面对真实表情包图像的有害指令时比面对人工图像时更易受影响,凸显了进行生态效度安全评估和加强防护机制的迫切需求。
English: This study reveals that vision-language models are significantly more vulnerable to harmful prompts from real meme images than artificial ones, highlighting the urgent need for ecologically valid safety evaluations and enhanced protective measures.

Authors:Raphael Sulzer, Liuyun Duan, Nicolas Girard, Florent Lafarge
Title: The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization
Abstract:
We present the P$^3$ dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P$^3$ offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P$^3$ dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at https://github.com/raphaelsulzer/PixelsPointsPolygons .
中文摘要:P$^3$数据集是一个融合航空激光雷达、高分辨率影像和矢量化建筑轮廓的大规模多模态基准数据集,通过多模态融合显著提升了建筑多边形预测的精度与几何质量。
English Summary: The P$^3$ dataset is a large-scale multimodal benchmark combining aerial LiDAR, high-resolution imagery, and vectorized building outlines across three continents, demonstrating LiDAR's effectiveness in predicting building polygons and achieving improved accuracy through multimodal fusion.

Authors:Lu Li, Cunhang Fan, Hongyu Zhang, Jingjing Zhang, Xiaoke Yang, Jian Zhou, Zhao Lv
Title: MHANet: Multi-scale Hybrid Attention Network for Auditory Attention Detection
Abstract:
Auditory attention detection (AAD) aims to detect the target speaker in a multi-talker environment from brain signals, such as electroencephalography (EEG), which has made great progress. However, most AAD methods solely utilize attention mechanisms sequentially and overlook valuable multi-scale contextual information within EEG signals, limiting their ability to capture long-short range spatiotemporal dependencies simultaneously. To address these issues, this paper proposes a multi-scale hybrid attention network (MHANet) for AAD, which consists of the multi-scale hybrid attention (MHA) module and the spatiotemporal convolution (STC) module. Specifically, MHA combines channel attention and multi-scale temporal and global attention mechanisms. This effectively extracts multi-scale temporal patterns within EEG signals and captures long-short range spatiotemporal dependencies simultaneously. To further improve the performance of AAD, STC utilizes temporal and spatial convolutions to aggregate expressive spatiotemporal representations. Experimental results show that the proposed MHANet achieves state-of-the-art performance with fewer trainable parameters across three datasets, 3 times lower than that of the most advanced model. Code is available at: https://github.com/fchest/MHANet.
中文摘要:本文提出MHANet多尺度混合注意力网络,通过结合通道注意力和多尺度时空注意机制,有效提取脑电信号中的多尺度时间模式并同时捕获长短程时空依赖关系,以更少的可训练参数在三个数据集上实现了最优性能。
English Summary: This paper introduces MHANet, a multi-scale hybrid attention network that enhances auditory attention detection by effectively capturing long-short range spatiotemporal dependencies in EEG signals through combined attention mechanisms and spatiotemporal convolutions, achieving state-of-the-art performance with significantly fewer parameters.

Authors:Jacob E. Kooi, Zhao Yang, Vincent François-Lavet
Title: Hadamax Encoding: Elevating Performance in Model-Free Atari
Abstract:
Neural network architectures have a large impact in machine learning. In reinforcement learning, network architectures have remained notably simple, as changes often lead to small gains in performance. This work introduces a novel encoder architecture for pixel-based model-free reinforcement learning. The Hadamax (\textbf{Hada}mard \textbf{max}-pooling) encoder achieves state-of-the-art performance by max-pooling Hadamard products between GELU-activated parallel hidden layers. Based on the recent PQN algorithm, the Hadamax encoder achieves state-of-the-art model-free performance in the Atari-57 benchmark. Specifically, without applying any algorithmic hyperparameter modifications, Hadamax-PQN achieves an 80\% performance gain over vanilla PQN and significantly surpasses Rainbow-DQN. For reproducibility, the full code is available on \href{https://github.com/Jacobkooi/Hadamax}{GitHub}.
中文: 本研究提出Hadamax编码器,这是一种用于像素强化学习的新型神经网络架构,通过对并行隐藏层间的Hadamard乘积进行最大池化,在Atari-57基准测试中实现了最先进的性能。
English: This work introduces the Hadamax encoder, a novel neural network architecture for pixel-based reinforcement learning that achieves state-of-the-art performance on the Atari-57 benchmark by max-pooling Hadamard products between parallel hidden layers.

Authors:Yuxuan Shu, Vasileios Lampos
Title: Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting
Abstract:
Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. Transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relationships among variables over time. To mitigate against this, we propose a novel architecture, namely the Spectral Operator Neural Network (Sonnet). Sonnet applies learnable wavelet transformations to the input and incorporates spectral analysis using the Koopman operator. Its predictive skill relies on the Multivariable Coherence Attention (MVCA), an operation that leverages spectral coherence to model variable dependencies. Our empirical analysis shows that Sonnet yields the best performance on $34$ out of $47$ forecasting tasks with an average mean absolute error (MAE) reduction of $1.1\%$ against the most competitive baseline (different per task). We further show that MVCA -- when put in place of the naïve attention used in various deep learning models -- can remedy its deficiencies, reducing MAE by $10.7\%$ on average in the most challenging forecasting tasks.
Chinese: 提出的谱算子神经网络(Sonnet)通过引入可学习小波变换和谱相干注意力机制,有效提升了多变量时间序列预测的准确性,在多项任务中显著降低了平均绝对误差。
English: The proposed Spectral Operator Neural Network (Sonnet) enhances multivariable time series forecasting by integrating learnable wavelet transformations and spectral coherence attention, achieving superior accuracy with reduced mean absolute error compared to existing methods.

Authors:Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada
Title: Towards Pre-training an Effective Respiratory Audio Foundation Model
Abstract:
Recent advancements in foundation models have sparked interest in respiratory audio foundation models. However, the effectiveness of applying conventional pre-training schemes to datasets that are small-sized and lack diversity has not been sufficiently verified. This study aims to explore better pre-training practices for respiratory sounds by comparing numerous pre-trained audio models. Our investigation reveals that models pre-trained on AudioSet, a general audio dataset, are more effective than the models specifically pre-trained on respiratory sounds. Moreover, combining AudioSet and respiratory sound datasets for further pre-training enhances performance, and preserving the frequency-wise information when aggregating features is vital. Along with more insights found in the experiments, we establish a new state-of-the-art for the OPERA benchmark, contributing to advancing respiratory audio foundation models. Our code is available online at https://github.com/nttcslab/eval-audio-repr/tree/main/plugin/OPERA.
中文摘要:本研究表明,在通用音频数据集AudioSet上的预训练优于专门的呼吸音模型,且将两者结合训练并保留频率特征时,能在OPERA基准测试中达到最优性能。
English Summary: This study demonstrates that pre-training on the general AudioSet dataset outperforms specialized respiratory sound models, and combining both datasets with frequency-preserving feature aggregation achieves state-of-the-art results on the OPERA benchmark.

Authors:Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Title: AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Abstract:
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.
中文:AgentThink是一个创新框架,通过将思维链推理与智能体式工具调用相结合,显著提升了自动驾驶模型的性能,使推理得分提高53.91%、答案准确率提升33.54%,并展现出强大的泛化能力。
English: AgentThink is a novel framework that integrates Chain-of-Thought reasoning with agent-style tool invocation to enhance autonomous driving models, significantly improving reasoning scores by 53.91% and answer accuracy by 33.54% while demonstrating robust generalization capabilities.

Authors:Yanzhi Tian, Zeming Liu, Zhengyang Liu, Yuhang Guo
Title: Exploring In-Image Machine Translation with Real-World Background
Abstract:
In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
Chinese: DebackX模型通过从复杂现实背景中分离并直接翻译文本再融合,提升了图像内机器翻译的翻译质量和视觉效果。
English: The DebackX model enhances In-Image Machine Translation by separating and translating text from complex real-world backgrounds before fusing it back, achieving superior translation quality and visual results.

Authors:Zihao Jiang, Ben Liu, Miao Peng, Wenjie Xu, Yao Xiao, Zhenyan Shan, Min Peng
Title: Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework
Abstract:
While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.
中文摘要:本文提出了GETER框架,通过将图结构与文本相结合来增强大语言模型的可解释时序推理能力,解决了其仅依赖文本时解释力不足的问题,并实现了最先进的性能表现。
English Summary: This paper introduces GETER, a novel framework that integrates graph structures with text to enhance explainable temporal reasoning in large language models, addressing their limitations in generating convincing explanations and achieving state-of-the-art performance.

Authors:Yifan Liu, Wuyang Li, Weihao Yu, Chenxin Li, Alexandre Alahi, Max Meng, Yixuan Yuan
Title: X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography
Abstract:
Computed Tomography serves as an indispensable tool in clinical workflows, providing non-invasive visualization of internal anatomical structures. Existing CT reconstruction works are limited to small-capacity model architecture and inflexible volume representation. In this work, we present X-GRM (X-ray Gaussian Reconstruction Model), a large feedforward model for reconstructing 3D CT volumes from sparse-view 2D X-ray projections. X-GRM employs a scalable transformer-based architecture to encode sparse-view X-ray inputs, where tokens from different views are integrated efficiently. Then, these tokens are decoded into a novel volume representation, named Voxel-based Gaussian Splatting (VoxGS), which enables efficient CT volume extraction and differentiable X-ray rendering. This combination of a high-capacity model and flexible volume representation, empowers our model to produce high-quality reconstructions from various testing inputs, including in-domain and out-domain X-ray projections. Our codes are available at: https://github.com/CUHK-AIM-Group/X-GRM.
中文:X-GRM提出了一种基于Transformer的大规模模型,通过创新的体素高斯溅射表示法,能够从稀疏X射线投影中重建高质量三维CT图像,实现了灵活高效的临床成像应用。
English: X-GRM introduces a large transformer-based model that reconstructs high-quality 3D CT volumes from sparse X-ray projections using a novel Voxel-based Gaussian Splatting representation, enabling flexible and efficient clinical imaging.

Authors:Ting Huang, Zeyu Zhang, Ruicheng Zhang, Yang Zhao
Title: DC-Scene: Data-Centric Learning for 3D Scene Understanding
Abstract:
3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality. However, advancing learning-based 3D scene understanding remains challenging due to two key limitations: (1) the large scale and complexity of 3D scenes lead to higher computational costs and slower training compared to 2D counterparts; and (2) high-quality annotated 3D datasets are significantly scarcer than those available for 2D vision. These challenges underscore the need for more efficient learning paradigms. In this work, we propose DC-Scene, a data-centric framework tailored for 3D scene understanding, which emphasizes enhancing data quality and training efficiency. Specifically, we introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler that progressively expands the training pool from the top 25% to 75% of scene-caption pairs. This strategy filters out noisy samples and significantly reduces dependence on large-scale labeled 3D data. Extensive experiments on ScanRefer and Nr3D demonstrate that DC-Scene achieves state-of-the-art performance (86.1 CIDEr with the top-75% subset vs. 85.4 with the full dataset) while reducing training cost by approximately two-thirds, confirming that a compact set of high-quality samples can outperform exhaustive training. Code will be available at https://github.com/AIGeeksGroup/DC-Scene.
中文摘要:提出的DC-Scene框架通过数据质量筛选和渐进式课程学习策略,有效解决了3D场景理解中的计算成本高和数据稀缺问题,在降低训练成本的同时实现了更优性能。
English Summary: The proposed DC-Scene framework addresses 3D scene understanding challenges by implementing a data-centric approach with quality filtering and curriculum learning, achieving superior performance with reduced training costs.

Authors:Yisi Luo, Xile Zhao, Deyu Meng
Title: Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives
Abstract:
Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: https://github.com/YisiLuo/Continuous-Representation-Zoo
中文摘要:连续表示方法通过灵活高效的功能映射提供卓越的数据处理能力,推动了计算机视觉和生物信息学等领域的发展,并持续进行研究和开源资源分享。
English Summary: Continuous representation methods offer superior data handling through flexible, efficient function mappings, advancing fields like computer vision and bioinformatics with ongoing research and open-source resources.

Authors:Haotian Qin, Dongliang Chang, Yueying Gao, Bingyao Yu, Lei Chen, Zhanyu Ma
Title: Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection
Abstract:
Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at https://github.com/Ant0ny44/InfoFD.
中文摘要:本文提出InfoFD框架,通过多模态条件瓶颈网络和动态文本正交化技术,在CLIP特征空间中减少特征冗余并利用文本偏差,显著提升了AI生成图像检测的泛化能力。
English Summary: The paper introduces InfoFD, a text-guided framework that enhances AI-generated image detection by reducing feature redundancy in CLIP through a multimodal conditional bottleneck network and dynamic text orthogonalization, significantly improving generalization performance.

Authors:Sampanna Yashwant Kahu
Title: KernelOracle: Predicting the Linux Scheduler's Next Move with Deep Learning
Abstract:
Efficient task scheduling is paramount in the Linux kernel, where the Completely Fair Scheduler (CFS) meticulously manages CPU resources to balance high utilization with interactive responsiveness. This research pioneers the use of deep learning techniques to predict the sequence of tasks selected by CFS, aiming to evaluate the feasibility of a more generalized and potentially more adaptive task scheduler for diverse workloads. Our core contributions are twofold: first, the systematic generation and curation of a novel scheduling dataset from a running Linux kernel, capturing real-world CFS behavior; and second, the development, training, and evaluation of a Long Short-Term Memory (LSTM) network designed to accurately forecast the next task to be scheduled. This paper further discusses the practical pathways and implications of integrating such a predictive model into the kernel's scheduling framework. The findings and methodologies presented herein open avenues for data-driven advancements in kernel scheduling, with the full source code provided for reproducibility and further exploration.
中文: 本研究开创性地利用深度学习技术,通过LSTM网络预测Linux内核中完全公平调度器的任务选择序列,构建了新型调度数据集并探讨了实现自适应任务调度的可行路径。
English: This research introduces a deep learning approach using an LSTM network to predict the Completely Fair Scheduler's task selection in the Linux kernel, creating a novel dataset and exploring integration possibilities for adaptive scheduling.

Authors:Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su
Title: Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
Abstract:
Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors.
Chinese: 提出的"先验深思"框架通过渐进式知识蒸馏和推理自省策略,充分整合知识图谱中的结构先验与约束先验,在ComplexWebQuestions数据集上实现13%的Hit@1提升,显著增强了大型语言模型生成结果的可信度。
English: The proposed Deliberation over Priors framework enhances LLM trustworthiness by integrating structural and constraint knowledge from knowledge graphs through progressive distillation and reasoning-introspection, achieving state-of-the-art performance with a 13% Hit@1 improvement on ComplexWebQuestions.

Authors:Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan
Title: MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
Abstract:
Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at https://github.com/CUHK-AIM-Group/MonoSplat.
中文:MonoSplat是一种创新框架,利用预训练的单目深度模型增强3D高斯重建,在保持计算效率的同时,实现了跨场景的卓越泛化能力和渲染质量。
English: MonoSplat is a novel framework that leverages pre-trained monocular depth models to enhance 3D Gaussian reconstruction, achieving superior generalization and rendering quality across diverse scenes while maintaining computational efficiency.

Authors:Yangting Shi, Renjie He, Le Hui, Xiang Li, Jian Yang, Ming-Ming Cheng, Yimian Dai
Title: AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection
Abstract:
Omni-domain infrared small target detection (IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multi-modal framework that fundamentally reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.
中文摘要:本文提出的AuxDet框架通过引入辅助元数据构建多模态检测范式,有效解决了全域红外小目标检测中视觉模型的泛化局限,在复杂场景基准测试中展现出最优性能。
English Summary: The proposed AuxDet framework addresses the limitations of visual-only models in omni-domain infrared small target detection by incorporating auxiliary metadata through a multimodal approach, achieving superior performance on complex benchmarks.

Authors:Yangting Shi, Yinfei Zhu, Renjie He, Le Hui, Meng Cai, Ming-Ming Cheng, Yimian Dai
Title: AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection
Abstract:
Omni-domain infrared small target detection (Omni-IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multimodal framework that is the first to incorporate metadata into the IRSTD paradigm for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.
中文摘要:本文提出的AuxDet框架通过引入辅助元数据构建多模态检测范式,有效解决了全域红外小目标检测中视觉模型的泛化局限,在复杂场景基准测试中展现出最优性能。
English Summary: The proposed AuxDet framework addresses the limitations of visual-only models in omni-domain infrared small target detection by incorporating auxiliary metadata through a multimodal approach, achieving superior performance on complex benchmarks.

Authors:Qian Zhou, Xianda Guo, Jilong Wang, Chuanfu Shen, Zhongyuan Wang, Hua Zou, Qin Zou, Chao Liang, Long Chen, Gang Wu
Title: Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets
Abstract:
Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20\% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at https://github.com/li1er3/Generalized_Gait.
中文: 该统一框架通过解耦三元组损失解决跨数据集优化冲突,并采用针对性数据集蒸馏策略过滤冗余样本,显著提升了跨域步态识别性能,在多个基准测试中表现优异且不影响源域精度。
English: The proposed unified framework enhances cross-domain gait recognition by employing a disentangled triplet loss to resolve inter-dataset optimization conflicts and a targeted dataset distillation strategy to eliminate redundant samples, achieving superior performance across multiple benchmarks without compromising source-domain accuracy.

Authors:Bo-Han Lai, Pin-Han Huang, Bo-Han Kung, Shang-Tse Chen
Title: Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss
Abstract:
Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet - a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at https://github.com/ntuaislab/BRONet.
Chinese: 本文提出了一种新颖的块反射正交(BRO)层和基于退火机制的新损失函数,增强了Lipschitz神经网络的表达能力,由此设计的BRONet在多个数据集上实现了最先进的认证鲁棒性。
English: This paper introduces a novel Block Reflector Orthogonal (BRO) layer and a new annealing-based loss function to enhance Lipschitz neural networks, resulting in BRONet, which achieves state-of-the-art certified robustness across multiple datasets.

Authors:Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, Jiang Bian
Title: R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization
Abstract:
Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.
中文: 本文提出RD-Agent(Q)这一首创的多智能体框架,通过迭代式的因子-模型协同优化实现量化策略全栈自动化研发,在真实市场中以更少因子获得更高收益,并超越现有先进模型。
English: This paper introduces RD-Agent(Q), a pioneering multi-agent framework that automates full-stack quantitative strategy development through iterative factor-model co-optimization, achieving superior returns with fewer factors and outperforming existing models in real-market tests.

Authors:Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang
Title: lmgame-Bench: How Good are LLMs at Playing Games?
Abstract:
Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
中文: 玩电子游戏能有效评估大型语言模型的能力,但直接应用存在感知脆弱和数据污染等挑战,因此开发了lmgame-Bench,通过稳定提示和去污染设计,在多样化游戏中实现可靠评估。
English: Playing video games effectively assesses large language models' capabilities, but direct integration faces challenges like brittle perception and data contamination, leading to the development of lmgame-Bench for reliable evaluation across diverse games with stable prompts and contamination removal.

Authors:Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, Zhanyu Ma
Title: CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation
Abstract:
Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.
中文: CineTechBench基准通过提供涵盖七个关键维度的600多张图像和120个剪辑的专业标注,填补了电影摄影专家标注数据的空白,全面评估多模态模型对电影技术的理解与生成能力。
English: The CineTechBench benchmark addresses the gap in expert-annotated cinematography data by providing over 600 images and 120 clips annotated across seven key dimensions, enabling comprehensive evaluation of multimodal models' understanding and generation of cinematic techniques.

Authors:Tong Cheng, Jie Fu, Xinpeng Ling, Huifa Li, Zhili Chen, Haifeng Qian, Junqing Gong
Title: EC-LDA : Label Distribution Inference Attack against Federated Graph Learning with Embedding Compression
Abstract:
Graph Neural Networks (GNNs) have been widely used for graph analysis. Federated Graph Learning (FGL) is an emerging learning framework to collaboratively train graph data from various clients. Although FGL allows client data to remain localized, a malicious server can still steal client private data information through uploaded gradient. In this paper, we for the first time propose label distribution attacks (LDAs) on FGL that aim to infer the label distributions of the client-side data. Firstly, we observe that the effectiveness of LDA is closely related to the variance of node embeddings in GNNs. Next, we analyze the relation between them and propose a new attack named EC-LDA, which significantly improves the attack effectiveness by compressing node embeddings. Then, extensive experiments on node classification and link prediction tasks across six widely used graph datasets show that EC-LDA outperforms the SOTA LDAs. Specifically, EC-LDA can achieve the Cos-sim as high as 1.0 under almost all cases. Finally, we explore the robustness of EC-LDA under differential privacy protection and discuss the potential effective defense methods to EC-LDA. Our code is available at https://github.com/cheng-t/EC-LDA.
中文: 本文提出EC-LDA,一种通过压缩节点嵌入来提升联邦图学习中标签分布推断效果的新型攻击方法,在多个数据集和任务中显著优于现有技术。
English: This paper introduces EC-LDA, a novel label distribution attack that enhances privacy inference in Federated Graph Learning by compressing node embeddings, achieving superior performance over existing methods across multiple datasets and tasks.

Authors:Seongmin Hwang, Daeyoung Han, Moongu Jeon
Title: Multispectral Detection Transformer with Infrared-Centric Feature Fusion
Abstract:
Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. Our key insight, derived from wavelet analysis and empirical observations, is that IR images contain structurally rich high-frequency information critical for object detection, making an infrared-centric approach highly effective. To capitalize on this finding, we propose Infrared-Centric Fusion (IC-Fusion), a lightweight and modality-aware sensor fusion method that prioritizes infrared features while effectively integrating complementary RGB semantic context. IC-Fusion adopts a compact RGB backbone and designs a novel fusion module comprising a Multi-Scale Feature Distillation (MSFD) block to enhance RGB features and a three-stage fusion block with a Cross-Modal Channel Shuffle Gate (CCSG), a Cross-Modal Large Kernel Gate (CLKG), and a Channel Shuffle Projection (CSP) to facilitate effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks demonstrate the superior effectiveness and efficiency of our IR-centric fusion strategy, further validating its benefits. Our code is available at https://github.com/smin-hwang/IC-Fusion.
中文: 本研究提出红外中心融合(IC-Fusion)方法,通过轻量级传感器融合优先处理红外特征并整合RGB语义信息,在基准数据集上验证了其卓越的性能与效率。
English: The study introduces Infrared-Centric Fusion (IC-Fusion), a lightweight sensor fusion method that prioritizes infrared features while integrating RGB semantic context, demonstrating superior performance and efficiency on benchmark datasets.

Authors:Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
Title: DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
Abstract:
Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.
Chinese: DeepKD提出了一种双级解耦与自适应去噪的训练框架,通过独立动量更新器和动态掩码机制解决知识流冲突并过滤噪声信号,在多个数据集上验证了其有效性。
English: DeepKD introduces a dual-level decoupling framework with adaptive denoising to address conflicts between knowledge components and filter noisy signals in knowledge distillation, achieving superior performance across multiple datasets.

Authors:Muniba Noreen, Furqan Shaukat
Title: Lung Nodule-SSM: Self-Supervised Lung Nodule Detection and Classification in Thoracic CT Images
Abstract:
Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel "LungNodule-SSM" method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:https://github.com/EMeRALDsNRPU/Lung-Nodule-SSM-Self-Supervised-Lung-Nodule-Detection-and-Classification/tree/main
中文: 提出的"LungNodule-SSM"方法采用基于DINOv2的自监督学习,在LUNA 16数据集上实现了98.37%的肺结节检测准确率,性能优于现有最先进方法。
English: The proposed "LungNodule-SSM" method leverages self-supervised learning with DINOv2 to achieve 98.37% accuracy in lung nodule detection on the LUNA 16 dataset, outperforming existing state-of-the-art approaches.

Authors:Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, Jiawei Han
Title: An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents
Abstract:
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.
中文摘要:强化学习能有效训练大语言模型开发结合推理与搜索引擎的智能搜索代理,其中奖励设计、模型选择和搜索引擎等关键因素显著影响性能与鲁棒性,为实际应用提供了重要指导。
English Summary: Reinforcement learning effectively trains large language models to create search agents that integrate reasoning with search engines, with key factors like reward design, model choice, and search engine selection critically impacting performance and robustness.

Authors:Zehong Wang, Zheyuan Liu, Tianyi Ma, Jiazheng Li, Zheyuan Zhang, Xingbo Fu, Yiyang Li, Zhengqing Yuan, Wei Song, Yijun Ma, Qingkai Zeng, Xiusi Chen, Jianan Zhao, Jundong Li, Meng Jiang, Pietro Lio, Nitesh Chawla, Chuxu Zhang, Yanfang Ye
Title: Graph Foundation Models: A Comprehensive Survey
Abstract:
Graph-structured data pervades domains such as social networks, biological systems, knowledge graphs, and recommender systems. While foundation models have transformed natural language processing, vision, and multimodal learning through large-scale pretraining and generalization, extending these capabilities to graphs -- characterized by non-Euclidean structures and complex relational semantics -- poses unique challenges and opens new opportunities. To this end, Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data, enabling broad transfer across graph-centric tasks and domains. This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework comprising three key components: backbone architectures, pretraining strategies, and adaptation mechanisms. We categorize GFMs by their generalization scope -- universal, task-specific, and domain-specific -- and review representative methods, key innovations, and theoretical insights within each category. Beyond methodology, we examine theoretical foundations including transferability and emergent capabilities, and highlight key challenges such as structural alignment, heterogeneity, scalability, and evaluation. Positioned at the intersection of graph learning and general-purpose AI, GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data. This survey consolidates current progress and outlines future directions to guide research in this rapidly evolving field. Resources are available at https://github.com/Zehong-Wang/Awesome-Foundation-Models-on-Graphs.
中文摘要:图基础模型(GFMs)通过骨干架构、预训练策略和适应机制,旨在为图结构数据提供可扩展的通用智能,在解决独特挑战的同时实现跨图中心任务和领域的广泛迁移。
English Summary: Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to graph-structured data through backbone architectures, pretraining strategies, and adaptation mechanisms, addressing unique challenges while enabling broad transfer across graph-centric tasks and domains.

Authors:Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Title: StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Abstract:
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.
中文: StepSearch框架通过引入基于信息增益和冗余惩罚的逐步奖励机制与过程监督,有效优化了大型语言模型在复杂多跳问答中的搜索能力,仅用少量训练数据即在标准测试中显著超越现有强化学习方法。
English: StepSearch is a novel framework that enhances multi-hop reasoning in LLMs by employing step-wise proximal policy optimization with detailed intermediate rewards and token-level supervision, significantly outperforming existing methods on complex QA benchmarks with minimal training data.

Authors:Qihang Yu, Kairui Fu, Shengyu Zhang, Zheqi Lv, Fan Wu, Fei Wu
Title: ThinkRec: Thinking-based recommendation via LLM
Abstract:
Recent advances in large language models (LLMs) have enabled more semantic-aware recommendations through natural language generation. Existing LLM for recommendation (LLM4Rec) methods mostly operate in a System 1-like manner, relying on superficial features to match similar items based on click history, rather than reasoning through deeper behavioral logic. This often leads to superficial and erroneous recommendations. Motivated by this, we propose ThinkRec, a thinking-based framework that shifts LLM4Rec from System 1 to System 2 (rational system). Technically, ThinkRec introduces a thinking activation mechanism that augments item metadata with keyword summarization and injects synthetic reasoning traces, guiding the model to form interpretable reasoning chains that consist of analyzing interaction histories, identifying user preferences, and making decisions based on target items. On top of this, we propose an instance-wise expert fusion mechanism to reduce the reasoning difficulty. By dynamically assigning weights to expert models based on users' latent features, ThinkRec adapts its reasoning path to individual users, thereby enhancing precision and personalization. Extensive experiments on real-world datasets demonstrate that ThinkRec significantly improves the accuracy and interpretability of recommendations. Our implementations are available in anonymous Github: https://github.com/Yu-Qi-hang/ThinkRec.
ThinkRec is a novel thinking-based framework that shifts LLM recommendations from superficial pattern matching to rational reasoning by incorporating keyword summarization, synthetic reasoning traces, and adaptive expert fusion, significantly improving recommendation accuracy and interpretability.
English Summary:

Authors:Zhiyu Shen, Jiyuan Liu, Yunhe Pang, Yanghui Rao
Title: HopWeaver: Synthesizing Authentic Multi-Hop Questions Across Text Corpora
Abstract:
Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first automatic framework synthesizing authentic multi-hop questions from unstructured text corpora without human intervention. HopWeaver synthesizes two types of multi-hop questions (bridge and comparison) using an innovative approach that identifies complementary documents across corpora. Its coherent pipeline constructs authentic reasoning paths that integrate information across multiple documents, ensuring synthesized questions necessitate authentic multi-hop reasoning. We further present a comprehensive system for evaluating synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our approach is valuable for developing MHQA datasets in specialized domains with scarce annotated resources. The code for HopWeaver is publicly available.
Chinese: HopWeaver是一种创新的跨文档框架,无需人工干预即可自动生成真实的多跳问题,以更低成本产出与人工标注数据集质量相当的高质量基准。
English: HopWeaver is an innovative cross-document framework that automatically generates authentic multi-hop questions without human intervention, producing high-quality benchmarks comparable to human-annotated datasets at lower cost.

Authors:Zhiyu Shen, Jiyuan Liu, Yunhe Pang, Yanghui Rao
Title: HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions
Abstract:
Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced QA models, especially in domains with scarce resources.
Chinese: HopWeaver是一种创新的跨文档框架,无需人工干预即可自动生成真实的多跳问题,以更低成本产出与人工标注数据集质量相当的高质量基准。
English: HopWeaver is an innovative cross-document framework that automatically generates authentic multi-hop questions without human intervention, producing high-quality benchmarks comparable to human-annotated datasets at lower cost.

Authors:Jeremy Qin
Title: Robust Multi-Modal Forecasting: Integrating Static and Dynamic Features
Abstract:
Time series forecasting plays a crucial role in various applications, particularly in healthcare, where accurate predictions of future health trajectories can significantly impact clinical decision-making. Ensuring transparency and explainability of the models responsible for these tasks is essential for their adoption in critical settings. Recent work has explored a top-down approach to bi-level transparency, focusing on understanding trends and properties of predicted time series using static features. In this work, we extend this framework by incorporating exogenous time series features alongside static features in a structured manner, while maintaining cohesive interpretation. Our approach leverages the insights of trajectory comprehension to introduce an encoding mechanism for exogenous time series, where they are decomposed into meaningful trends and properties, enabling the extraction of interpretable patterns. Through experiments on several synthetic datasets, we demonstrate that our approach remains predictive while preserving interpretability and robustness. This work represents a step towards developing robust, and generalized time series forecasting models. The code is available at https://github.com/jeremy-qin/TIMEVIEW
Chinese: 本研究通过将外生时间序列特征与静态特征结构化结合,在保持可解释性和鲁棒性的同时改进了时间序列预测,并在合成数据集上得到验证。
English: This study enhances time series forecasting by integrating exogenous time series features with static features in a structured way to maintain interpretability and robustness, as validated on synthetic datasets.

Authors:Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang
Title: DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Abstract:
Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.
中文: 提出的DISCO方法通过引入领域感知和难度感知的奖励缩放机制,有效解决GRPO在多领域数据中的群体不平衡问题,显著提升模型的泛化能力与公平性,并在多领域对齐基准测试中创下最新最优表现。
English: The proposed DISCO method enhances GRPO by incorporating domain-aware and difficulty-aware reward scaling to address inter-group imbalance, improving generalization and fairness across multi-domain datasets while achieving state-of-the-art performance.

Authors:Chen Huang, Junkai Luo, Xinzuo Wang, Wenqiang Lei, Jiancheng Lv
Title: Can Large Language Models Understand Internet Buzzwords Through User-Generated Content
Abstract:
The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at https://github.com/SCUNLP/Buzzword.
中文: 本文提出了首个中文网络流行语数据集CHEER和RESS方法,通过引导大语言模型理解用户生成内容来提升流行语定义生成准确性,同时评估了现有方法的优劣并揭示了关键挑战。
English: This paper introduces CHEER, the first Chinese internet buzzword dataset, and proposes RESS, a method to improve large language models' accuracy in generating buzzword definitions from user-generated content, while benchmarking existing approaches and identifying key challenges.

Authors:Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov
Title: UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking
Abstract:
The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck's code and datasets are open-sourced and publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.
Chinese: 本研究推出了首个针对乌尔都语的全面事实核查框架UrduFactCheck,通过采用多策略证据检索系统解决可靠信息稀缺问题,在新开发的基准测试中优于现有方法,同时评估了多种大语言模型在乌尔都语中的事实准确性。
English: This study introduces UrduFactCheck, the first comprehensive fact-checking framework for Urdu that addresses the scarcity of reliable information by employing a multi-strategy evidence retrieval system and outperforms existing methods on newly developed benchmarks, while also evaluating the factual accuracy of various LLMs in Urdu.

Authors:Wen-Chin Huang, Erica Cooper, Tomoki Toda
Title: SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit
Abstract:
We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-dataset and multi-model support, as well as pre-trained models accessible via Torch Hub and HuggingFace Spaces. To demonstrate its capabilities, we re-evaluated SSL-MOS, a speech self-supervised learning (SSL)-based SSQA model widely used in recent scientific papers, on an extensive list of speech SSL models. Experiments were conducted on two representative SSQA datasets named BVCC and NISQA, and we identified the optimal speech SSL model, whose performance surpassed the original SSL-MOS implementation and was comparable to state-of-the-art methods.
Chinese: SHEET 是一款开源工具包,旨在通过提供数据驱动模型、训练脚本和预训练模型来加速主观语音质量评估研究,实验证明其优化的 SSL 模型性能超越原版并媲美最先进方法。
English: SHEET is an open-source toolkit designed to accelerate subjective speech quality assessment research by providing data-driven models, training scripts, and pre-trained models, with experiments showing its optimized SSL model outperforms the original and matches state-of-the-art methods.

Authors:Yingming Pu, Tao Lin, Hongyu Chen
Title: PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration
Abstract:
Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction. Overcoming these limitations fundamentally requires systematic uncertainty reduction. We introduce \texttt{PiFlow}, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains -- discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties -- our method significantly improves discovery efficiency, reflected by a 73.55\% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06\% compared to a vanilla agent system. Overall, \texttt{PiFlow} serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow}{GitHub}.
Chinese: PiFlow提出了一种信息理论框架,将自动化科学发现视为结构化的不确定性减少过程,在多个科学领域中相比现有方法显著提高了发现效率和解决方案质量。
English: PiFlow introduces an information-theoretic framework that treats automated scientific discovery as structured uncertainty reduction, significantly improving efficiency and solution quality across multiple scientific domains compared to existing methods.

Authors:Yingming Pu, Tao Lin, Hongyu Chen
Title: PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration
Abstract:
Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains -- discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties -- our method significantly improves discovery efficiency, reflected by a 73.55\% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06\% compared to a vanilla agent system. Overall, PiFlow serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow}{GitHub}.
Chinese: PiFlow提出了一种信息理论框架,将自动化科学发现视为结构化的不确定性减少过程,在多个科学领域中相比现有方法显著提高了发现效率和解决方案质量。
English: PiFlow introduces an information-theoretic framework that treats automated scientific discovery as structured uncertainty reduction, significantly improving efficiency and solution quality across multiple scientific domains compared to existing methods.

Authors:Ze Wang, Jingang Qu, Zhenyu Gao, Pascal Morin
Title: Learning-based Airflow Inertial Odometry for MAVs using Thermal Anemometers in a GPS and vision denied environment
Abstract:
This work demonstrates an airflow inertial based odometry system with multi-sensor data fusion, including thermal anemometer, IMU, ESC, and barometer. This goal is challenging because low-cost IMUs and barometers have significant bias, and anemometer measurements are very susceptible to interference from spinning propellers and ground effects. We employ a GRU-based deep neural network to estimate relative air speed from noisy and disturbed anemometer measurements, and an observer with bias model to fuse the sensor data and thus estimate the state of aerial vehicle. A complete flight data, including takeoff and landing on the ground, shows that the approach is able to decouple the downwash induced wind speed caused by propellers and the ground effect, and accurately estimate the flight speed in a wind-free indoor environment. IMU, and barometer bias are effectively estimated, which significantly reduces the position integration drift, which is only 5.7m for 203s manual random flight. The open source is available on https://github.com/SyRoCo-ISIR/Flight-Speed-Estimation-Airflow.
中文: 本研究提出了一种基于多传感器融合的气流惯性里程系统,采用GRU神经网络解耦螺旋桨引起的干扰,在无风环境下将203秒飞行的位置漂移降至5.7米,实现了精确的速度估计。
English: This study presents an airflow inertial odometry system using multi-sensor fusion and a GRU-based network to accurately estimate flight speed by decoupling propeller-induced disturbances and reducing position drift to 5.7m over 203 seconds in wind-free conditions.

Authors:Ivan Smirnov, Shangding Gu
Title: RLBenchNet: The Right Network for the Right Reinforcement Learning Task
Abstract:
Reinforcement learning (RL) has seen significant advancements through the application of various neural network architectures. In this study, we systematically investigate the performance of several neural networks in RL tasks, including Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive evaluation across continuous control, discrete decision-making, and memory-based environments, we identify architecture-specific strengths and limitations. Our results reveal that: (1) MLPs excel in fully observable continuous control tasks, providing an optimal balance of performance and efficiency; (2) recurrent architectures like LSTM and GRU offer robust performance in partially observable environments with moderate memory requirements; (3) Mamba models achieve a 4.5x higher throughput compared to LSTM and a 3.9x increase over GRU, all while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 successfully solve the most challenging memory-intensive tasks, with Mamba-2 requiring 8x less memory than Transformer-XL. These findings provide insights for researchers and practitioners, enabling more informed architecture selection based on specific task characteristics and computational constraints. Code is available at: https://github.com/SafeRL-Lab/RLBenchNet
中文: 本研究系统评估了多种神经网络在强化学习任务中的表现,发现MLP在完全可观测任务中表现卓越,循环网络在部分可观测环境中稳健高效,Mamba模型吞吐量显著提升,而仅Transformer-XL变体和Mamba-2能解决最高内存需求任务且效率优势突出。
English: This study systematically evaluates various neural network architectures in reinforcement learning tasks, revealing that MLPs excel in continuous control, recurrent networks handle partial observability efficiently, Mamba models offer superior throughput, and only Transformer-XL variants and Mamba-2 solve the most memory-intensive challenges with significant efficiency gains.

Authors:Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Title: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Abstract:
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
中文摘要:Tango提出了一种新颖的强化学习框架,通过协同训练LLM生成器和生成式验证器,在无需过程级标注的情况下实现两者相互增强,并在复杂推理任务上取得了最先进的性能。
English Summary: Tango introduces a novel reinforcement learning framework that co-trains an LLM generator and a generative verifier, achieving state-of-the-art performance on complex reasoning tasks through their mutual reinforcement without requiring process-level annotations.

Authors:Qingyu Song, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan
Title: Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC
Abstract:
The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.
中文: 本研究提出了针对设备端大语言模型的系统评估方法,发现采用低比特量化的大型模型性能优于高精度小型模型,并为资源受限的边缘设备部署提供了实用指导。
English: The study introduces a systematic evaluation methodology for on-device LLMs, revealing that larger models with low-bit quantization outperform smaller high-precision ones and offering practical deployment guidelines for edge devices.

Authors:Alvin Heng, Harold Soh
Title: Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
Abstract:
Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman--Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman--Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at https://github.com/clear-nus/sc-likelihood-ratios.
中文: 本研究通过奈曼-皮尔逊引理重新审视选择性分类,提出的似然比方法在协变量偏移场景下的视觉和语言任务中优于现有基线。
English: This study revisits selective classification through the Neyman-Pearson lemma, proposing likelihood ratio-based methods that outperform existing baselines under covariate shift scenarios in vision and language tasks.

Authors:Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada
Title: MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks
Abstract:
Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.
Chinese: 本文提出了一种灵活的多模态多任务预训练策略,采用多模态多任务掩码自编码器(MultiMAE)处理地球观测数据,通过重建多种输入模态提升了迁移学习能力,在分类和分割任务上超越了现有最优方法。
English: This paper introduces a flexible multi-modal, multi-task pre-training strategy using a Multi-modal Multi-task Masked Autoencoder (MultiMAE) for Earth Observation data, which enhances transfer learning by reconstructing diverse input modalities and outperforms existing methods in classification and segmentation tasks.

Authors:Zhiwei Liu, Paul Thompson, Jiaqi Rong, Sophia Ananiadou
Title: ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories
Abstract:
Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at https://github.com/lzw108/ConspEmoLLM.
Chinese Summary: 本研究通过构建增强数据集ConDID-v2和改进检测模型ConspEmoLLM-v2,有效解决了大语言模型生成的情绪伪装型阴谋论检测难题,显著提升了针对情感修饰内容的识别性能。
English Summary: This study addresses the challenge of detecting LLM-generated conspiracy theories that disguise negative emotional cues by introducing an augmented dataset, ConDID-v2, and an enhanced detection model, ConspEmoLLM-v2, which significantly improves detection accuracy on sentiment-manipulated content.

Authors:Xiaoyan Bai, Ike Peng, Aditya Singh, Chenhao Tan
Title: Concept Incongruence: An Exploration of Time and Death in Role Playing
Abstract:
Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.
中文摘要:本研究提出“概念冲突”来分析语言模型处理概念边界冲突时的表现,发现模型在角色扮演中因死亡状态编码不可靠和时间表征偏移,常无法停止回答且准确性下降。
English Summary: This study introduces "concept incongruence" to examine how language models handle conflicting concept boundaries, revealing that models often fail to abstain from answering when roles die in role-play scenarios due to unreliable temporal encoding and representation shifts.

Authors:Tuan-Nghia Bui, Huy-Son Nguyen, Cam-Van Thi Nguyen, Hoang-Quynh Le, Duc-Trong Le
Title: Personalized Diffusion Model Reshapes Cold-Start Bundle Recommendation
Abstract:
Bundle recommendation aims to recommend a set of items to each user. However, the sparser interactions between users and bundles raise a big challenge, especially in cold-start scenarios. Traditional collaborative filtering methods do not work well for this kind of problem because these models rely on interactions to update the latent embedding, which is hard to work in a cold-start setting. We propose a new approach (DisCo), which relies on a personalized Diffusion backbone, enhanced by disentangled aspects for the user's interest, to generate a bundle in distribution space for each user to tackle the cold-start challenge. During the training phase, DisCo adjusts an additional objective loss term to avoid bias, a prevalent issue while using the generative model for top-$K$ recommendation purposes. Our empirical experiments show that DisCo outperforms five comparative baselines by a large margin on three real-world datasets. Thereby, this study devises a promising framework and essential viewpoints in cold-start recommendation. Our materials for reproducibility are available at: https://github.com/bt-nghia/DisCo.
中文摘要:提出的DisCo模型通过个性化扩散框架和解耦用户兴趣,有效解决了捆绑推荐中的冷启动问题,在多个数据集上显著优于现有基线方法。
English Summary: The proposed DisCo model uses a personalized diffusion backbone and disentangled user interests to effectively address the cold-start challenge in bundle recommendation, outperforming existing methods across multiple datasets.

Authors:Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy
Title: Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Abstract:
Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2 \& 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.
中文: Polar Sparsity通过将稀疏计算重点转向注意力层,实现了大规模语言模型推理加速,在不同批处理规模下达到2.2倍提速且保持精度无损。
English: Polar Sparsity accelerates large language model inference by shifting focus to attention layer sparsity, achieving up to 2.2× speedup across various models and batch sizes without accuracy loss.

Authors:So Won Jeong, Claire Donnat
Title: LOBSTUR: A Local Bootstrap Framework for Tuning Unsupervised Representations in Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) are increasingly used in conjunction with unsupervised learning techniques to learn powerful node representations, but their deployment is hindered by their high sensitivity to hyperparameter tuning and the absence of established methodologies for selecting the optimal models. To address these challenges, we propose LOBSTUR-GNN ({\bf Lo}cal {\bf B}oot{\bf s}trap for {\bf T}uning {\bf U}nsupervised {\bf R}epresentations in GNNs) i), a novel framework designed to adapt bootstrapping techniques for unsupervised graph representation learning. LOBSTUR-GNN tackles two main challenges: (a) adapting the bootstrap edge and feature resampling process to account for local graph dependencies in creating alternative versions of the same graph, and (b) establishing robust metrics for evaluating learned representations without ground-truth labels. Using locally bootstrapped resampling and leveraging Canonical Correlation Analysis (CCA) to assess embedding consistency, LOBSTUR provides a principled approach for hyperparameter tuning in unsupervised GNNs. We validate the effectiveness and efficiency of our proposed method through extensive experiments on established academic datasets, showing an 65.9\% improvement in the classification accuracy compared to an uninformed selection of hyperparameters. Finally, we deploy our framework on a real-world application, thereby demonstrating its validity and practical utility in various settings. \footnote{The code is available at \href{https://github.com/sowonjeong/lobstur-graph-bootstrap}{github.com/sowonjeong/lobstur-graph-bootstrap}.}
中文: LOBSTUR-GNN框架通过局部引导重采样和典型相关分析,解决了无监督图神经网络对超参数敏感的问题,相比随机参数选择实现了65.9%的分类准确率提升。
English: The LOBSTUR-GNN framework addresses hyperparameter sensitivity in unsupervised graph neural networks by employing local bootstrap resampling and canonical correlation analysis for robust model evaluation, achieving 65.9% higher classification accuracy than random parameter selection.

Authors:Daniya Najiha A. Kareem, Jean Lahoud, Mustansar Fiaz, Amandeep Kumar, Hisham Cholakkal
Title: Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets
Abstract:
Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training data is highly imbalanced, which may lead to model exhibiting bias toward the more frequently represented classes. Moreover, real-world data may contain unseen classes that need to be identified, and model performance is affected by the data scarcity. While medical image recognition has been extensively addressed in the literature, current methods do not take into account all the intricacies in the real-world scenarios. To this end, we propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach. Understanding the adverse impact of long-tail distribution at the inherent model characteristics, we implement a regularization strategy at the feature level complemented by a classifier normalization technique. We conduct extensive experiments on the publicly available datasets, ISIC2018, ISIC2019, and TissueMNIST with various numbers of labelled samples. Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network in terms of closed-set and open-set accuracies on all datasets. Our code and trained models will be made publicly available at https://github.com/Daniyanaj/OpenLTR.
中文摘要:本研究提出一种基于半监督学习的开放集识别方法,通过特征级正则化和分类器归一化技术解决医学影像中的长尾分布问题,在多个公开数据集上验证了该方法对闭集和开放集识别性能的显著提升。
English Summary: This study introduces an open-set learning method using semi-supervised techniques to address class imbalance and unseen class recognition in medical imaging, demonstrating improved performance across multiple datasets through feature regularization and classifier normalization.

Authors:Juan Nathaniel, Carla Roesch, Jatan Buch, Derek DeSantis, Adam Rupe, Kara Lamb, Pierre Gentine
Title: Deep Koopman operator framework for causal discovery in nonlinear dynamical systems
Abstract:
We use a deep Koopman operator-theoretic formalism to develop a novel causal discovery algorithm, Kausal. Causal discovery aims to identify cause-effect mechanisms for better scientific understanding, explainable decision-making, and more accurate modeling. Standard statistical frameworks, such as Granger causality, lack the ability to quantify causal relationships in nonlinear dynamics due to the presence of complex feedback mechanisms, timescale mixing, and nonstationarity. This presents a challenge in studying many real-world systems, such as the Earth's climate. Meanwhile, Koopman operator methods have emerged as a promising tool for approximating nonlinear dynamics in a linear space of observables. In Kausal, we propose to leverage this powerful idea for causal analysis where optimal observables are inferred using deep learning. Causal estimates are then evaluated in a reproducing kernel Hilbert space, and defined as the distance between the marginal dynamics of the effect and the joint dynamics of the cause-effect observables. Our numerical experiments demonstrate Kausal's superior ability in discovering and characterizing causal signals compared to existing approaches of prescribed observables. Lastly, we extend our analysis to observations of El Niño-Southern Oscillation highlighting our algorithm's applicability to real-world phenomena. Our code is available at https://github.com/juannat7/kausal.
中文摘要:Kausal算法采用深度Koopman算子方法,通过深度学习推断最优观测量并在再生核希尔伯特空间中评估因果关系,在非线性系统中展现出优于现有方法的因果发现能力,并成功应用于厄尔尼诺-南方涛动等实际气候现象分析。
English Summary: The Kausal algorithm employs a deep Koopman operator approach to discover causal relationships in nonlinear systems by learning optimal observables through deep learning and evaluating causality in a reproducing kernel Hilbert space, demonstrating superior performance in both simulations and real-world climate data analysis.

Authors:Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Title: Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Abstract:
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.
中文摘要:本研究追踪了OLMo-7B预训练过程中事实回忆与跨语言一致性的演变,发现提升主要受训练数据中事实频率驱动,而早期阶段的跨语言迁移特别有助于低频非英语事实的正确回忆。
English Summary: This study tracks the evolution of factual recall and crosslingual consistency during OLMo-7B's pretraining, revealing that improvements are primarily driven by fact frequency in training data, with crosslingual transfer from English particularly aiding low-frequency non-English facts in early stages.

Authors:Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Title: Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Abstract:
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.
中文摘要:本研究追踪了OLMo-7B预训练过程中事实回忆与跨语言一致性的演变,发现提升主要受训练数据中事实频率驱动,而早期阶段的跨语言迁移特别有助于低频非英语事实的正确回忆。
English Summary: This study tracks the evolution of factual recall and crosslingual consistency during OLMo-7B's pretraining, revealing that improvements are primarily driven by fact frequency in training data, with crosslingual transfer from English particularly aiding low-frequency non-English facts in early stages.

Authors:Kushagra Gupta, Surya Murthy, Mustafa O. Karabag, Ufuk Topcu, David Fridovich-Keil
Title: Cooperative Bargaining Games Without Utilities: Mediated Solutions from Direction Oracles
Abstract:
Cooperative bargaining games are widely used to model resource allocation and conflict resolution. Traditional solutions assume the mediator can access agents utility function values and gradients. However, there is an increasing number of settings, such as human AI interactions, where utility values may be inaccessible or incomparable due to unknown, nonaffine transformations. To model such settings, we consider that the mediator has access only to agents most preferred directions, i.e., normalized utility gradients in the decision space. To this end, we propose a cooperative bargaining algorithm where a mediator has access to only the direction oracle of each agent. We prove that unlike popular approaches such as the Nash and Kalai Smorodinsky bargaining solutions, our approach is invariant to monotonic nonaffine transformations, and that under strong convexity and smoothness assumptions, this approach enjoys global asymptotic convergence to Pareto stationary solutions. Moreover, we show that the bargaining solutions found by our algorithm also satisfy the axioms of symmetry and (under slightly stronger conditions) independence of irrelevant alternatives, which are popular in the literature. Finally, we conduct experiments in two domains, multi agent formation assignment and mediated stock portfolio allocation, which validate these theoretic results. All code for our experiments can be found at https://github.com/suryakmurthy/dibs_bargaining.
中文: 本研究提出了一种仅依赖代理人效用梯度方向信息的合作议价算法,该算法对单调非仿射变换具有不变性,并在特定凸性条件下实现全局收敛至帕累托稳态解。
English: This study introduces a cooperative bargaining algorithm that uses only directional oracles of agents' utility gradients, ensuring invariance to monotonic nonaffine transformations and achieving global convergence to Pareto stationary solutions under specific convexity conditions.

Authors:Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng
Title: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Abstract:
Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.
Chinese: 本研究提出了MathIF基准,揭示大型语言模型在提升推理能力时常削弱其遵循指令的能力,凸显了二者间的矛盾,需开发更注重指令的推理模型。
English: The study introduces MathIF, a benchmark revealing that enhancing reasoning in large language models often compromises their ability to follow instructions, highlighting a trade-off that necessitates more instruction-aware models.

Authors:Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal
Title: This Time is Different: An Observability Perspective on Time Series Foundation Models
Abstract:
We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.
中文: Toto是一个拥有1.51亿参数的时间序列预测基础模型,在可观测性和通用基准测试中均达到最优性能,其模型和评估代码已开源。
English: Toto is a 151-million-parameter time series forecasting foundation model that achieves state-of-the-art performance on observability and general benchmarks, with its model and evaluation code being open-sourced.

Authors:Chih-Yu Chang, Milad Azvar, Chinedum Okwudire, Raed Al Kontar
Title: $\texttt{LLINBO}$: Trustworthy LLM-in-the-Loop Bayesian Optimization
Abstract:
Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.
中文: 本文提出LLINBO混合框架,通过结合大语言模型的语境探索能力和高斯过程等统计代理模型的原理化开发,实现了贝叶斯优化的理论保障,并提供了3D打印的实际应用验证。
English: The paper introduces LLINBO, a hybrid Bayesian optimization framework that integrates Large Language Models for contextual exploration and statistical surrogate models like Gaussian Processes for principled exploitation, with theoretical guarantees and a 3D printing application.

Authors:Chih-Yu Chang, Milad Azvar, Chinedum Okwudire, Raed Al Kontar
Title: LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization
Abstract:
Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.
中文: 本文提出LLINBO混合框架,通过结合大语言模型的语境探索能力和高斯过程等统计代理模型的原理化开发,实现了贝叶斯优化的理论保障,并提供了3D打印的实际应用验证。
English: The paper introduces LLINBO, a hybrid Bayesian optimization framework that integrates Large Language Models for contextual exploration and statistical surrogate models like Gaussian Processes for principled exploitation, with theoretical guarantees and a 3D printing application.

Authors:Hong Huang, Dapeng Wu
Title: Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis
Abstract:
Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers, a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (OSSH): During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations. Building on OSSH, we propose Quaff, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff's efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at https://github.com/Little0o0/Quaff.git.
中文: 提出的Quaff框架基于异常值空间稳定性假说,实现了大型语言模型的高效量化微调,在显著降低延迟和内存占用的同时保持或提升精度,从而推动模型在资源受限设备上的部署。
English: The proposed Quaff framework leverages the Outlier Spatial Stability Hypothesis to enable efficient quantized fine-tuning of large language models, achieving significant latency reduction and memory savings while maintaining or improving accuracy, thus facilitating deployment on resource-constrained devices.

Authors:Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xinpeng Hong, Weiqing Liu, Yelong Shen, Weizhu Chen, Jiang Bian
Title: R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution
Abstract:
Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. While crowdsourcing platforms alleviate some challenges, high-level data science tasks remain labor-intensive and iterative. To overcome these limitations, we introduce R&D-Agent, a dual-agent framework for iterative exploration. The Researcher agent uses performance feedback to generate ideas, while the Developer agent refines code based on error feedback. By enabling multiple parallel exploration traces that merge and enhance one another, R&D-Agent narrows the gap between automated solutions and expert-level performance. Evaluated on MLE-Bench, R&D-Agent emerges as the top-performing machine learning engineering agent, demonstrating its potential to accelerate innovation and improve precision across diverse data science applications. We have open-sourced R&D-Agent on GitHub: https://github.com/microsoft/RD-Agent.
Chinese: R&D-Agent 是一个结构化框架,将机器学习工程从临时性工艺转变为规范化流程,在 MLE-Bench 上以 35.1% 的奖牌率实现顶尖性能,有效推动了数据科学领域的创新加速。
English: R&D-Agent is a structured framework that transforms machine learning engineering from an ad-hoc process into a principled workflow, achieving top performance with a 35.1% medal rate on MLE-Bench and accelerating innovation in data science.

Authors:Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, Jiang Bian
Title: R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science
Abstract:
Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. Although crowd-sourcing platforms alleviate some challenges, high-level machine learning engineering (MLE) tasks remain labor-intensive and iterative. We introduce R&D-Agent, a comprehensive, decoupled, and extensible framework that formalizes the MLE process. R&D-Agent defines the MLE workflow into two phases and six components, turning agent design for MLE from ad-hoc craftsmanship into a principled, testable process. Although several existing agents report promising gains on their chosen components, they can mostly be summarized as a partial optimization from our framework's simple baseline. Inspired by human experts, we designed efficient and effective agents within this framework that achieve state-of-the-art performance. Evaluated on MLE-Bench, the agent built on R&D-Agent ranks as the top-performing machine learning engineering agent, achieving 35.1% any medal rate, demonstrating the ability of the framework to speed up innovation and improve accuracy across a wide range of data science applications. We have open-sourced R&D-Agent on GitHub: https://github.com/microsoft/RD-Agent.
Chinese: R&D-Agent 是一个结构化框架,将机器学习工程从临时性工艺转变为规范化流程,在 MLE-Bench 上以 35.1% 的奖牌率实现顶尖性能,有效推动了数据科学领域的创新加速。
English: R&D-Agent is a structured framework that transforms machine learning engineering from an ad-hoc process into a principled workflow, achieving top performance with a 35.1% medal rate on MLE-Bench and accelerating innovation in data science.

Authors:Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
Title: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Abstract:
The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
中文: 本文提出MSViT这一新型脉冲驱动Transformer架构,通过采用多尺度脉冲注意力机制有效解决了现有SNN变换器在图像多尺度特征提取上的瓶颈,在多个数据集上实现了最优性能。
English: This paper introduces MSViT, a novel spike-driven Transformer architecture that employs multi-scale spiking attention to overcome feature extraction limitations in existing SNN-based transformers, achieving state-of-the-art performance across multiple datasets.

Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Chen Jiang, Tan Pan, Xingmeng Zhang, Cenyu Liu, Zeyun Miao, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Yichi Zhang, Wenbo Zhang, Fengping Zhu, Limei Han, Yuan Qi, Chensen Lin, Yuan Cheng
Title: Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
Abstract:
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.
Chinese: 本研究通过合成3D形状进行流体动力学模拟,构建了一个大规模颅内动脉瘤血流动力学数据集,旨在支持机器学习应用,以促进临床风险评估和生物流体研究的高效发展。
English: This study presents a large-scale hemodynamic dataset of intracranial aneurysms, created through computational fluid dynamics simulations on synthetic 3D shapes, to enable machine learning applications for efficient clinical risk assessment and biofluid research.

Authors:Tuan-Vinh La, Minh-Hieu Nguyen, Minh-Son Dao
Title: KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
Abstract:
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.
中文: 本文提出了一种新颖的多模态假新闻检测框架,融合视觉、文本和知识表征,通过细粒度对象细节、全局图像语义和外部知识,利用基于Transformer的分类器超越现有方法。
English: This paper introduces a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations, leveraging fine-grained object details, global image semantics, and external knowledge to outperform existing methods through a Transformer-based classifier.

Authors:Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
Title: FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
Abstract:
Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car
中文摘要:FastCar框架通过利用相邻帧间MLP输出的时间冗余性,采用缓存复用策略和硬件加速技术,将自回归视频生成的解码速度提升2.1倍以上,同时显著提高能效并支持高分辨率长视频生成。
English Summary: The FastCar framework accelerates auto-regressive video generation by exploiting temporal redundancy in MLP outputs, achieving over 2.1x decoding speedup and higher energy efficiency through optimized caching strategies and FPGA hardware acceleration.

Authors:Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Title: DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
Abstract:
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention
中文: 提出的DraftAttention框架通过GPU上的动态稀疏注意力加速视频扩散变换器,在保持生成质量的同时实现了高达1.75倍的端到端加速。
English: The proposed DraftAttention framework accelerates video diffusion transformers by using dynamic sparse attention on GPUs, achieving up to 1.75x speedup while maintaining generation quality.

Authors:Xiaojie Gu, Guangxu Chen, Jungang Li, Jia-Chen Gu, Xuming Hu, Kai Zhang
Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models
Abstract:
Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.
中文: UltraEdit提出了一种高效且可扩展的终身模型编辑方法,通过单步参数调整和持续归一化策略,在显著提升编辑速度的同时降低内存消耗,使大型语言模型能在消费级硬件上实现大规模知识更新。
English: UltraEdit introduces a highly efficient and scalable lifelong model editing approach that significantly accelerates editing speeds while reducing memory usage, enabling extensive updates to large language models on consumer-grade hardware.

Authors:Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang
Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
Abstract:
Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds over 7x faster than the previous state-of-the-art method, which was also the fastest known approach, while using less than 1/4 the VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at: https://github.com/XiaojieGu/UltraEdit
中文: UltraEdit提出了一种高效且可扩展的终身模型编辑方法,通过单步参数调整和持续归一化策略,在显著提升编辑速度的同时降低内存消耗,使大型语言模型能在消费级硬件上实现大规模知识更新。
English: UltraEdit introduces a highly efficient and scalable lifelong model editing approach that significantly accelerates editing speeds while reducing memory usage, enabling extensive updates to large language models on consumer-grade hardware.

Authors:Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang
Title: UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens
Abstract:
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.
中文: UniCTokens提出了一种统一框架,将个性化概念令牌整合到视觉语言模型中,以同时提升理解和生成任务,在个性化知识驱动生成方面实现了领先性能。
English: UniCTokens introduces a unified framework that integrates personalized concept tokens within a vision-language model to enhance both understanding and generation tasks, achieving state-of-the-art performance in personalized knowledge-driven generation.

Authors:Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
Title: Quartet: Native FP4 Training Can Be Optimal for Large Language Models
Abstract:
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
中文摘要:本文提出Quartet方法,通过硬件支持的FP4精度实现大语言模型的端到端训练,在保持与FP16和FP8相当性能的同时显著提升计算效率。
English Summary: This paper introduces Quartet, a hardware-supported method for accurate end-to-end FP4 training of large language models, enabling competitive performance with FP16 and FP8 while improving computational efficiency.

Authors:Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng
Title: AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings
Abstract:
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.
Chinese Summary: 本文提出AKRMap这一新型降维技术,通过学习投影空间中度量景观的核回归,能够更准确地可视化跨模态嵌入,在定量实验中超越了PCA和t-SNE等传统方法。
English Summary: This paper introduces AKRMap, a novel dimensionality reduction technique that visualizes cross-modal embeddings more accurately by learning kernel regression of the metric landscape, outperforming traditional methods like PCA and t-SNE.

Authors:Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin
Title: Beyond Words: Multimodal LLM Knows When to Speak
Abstract:
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
中文: 本研究提出MM-When2Speak多模态大语言模型,通过整合对齐的视觉、听觉和文本数据来预测最佳回应时机与类型,在对话响应时间准确性上相比现有模型提升高达四倍。
English: This study introduces MM-When2Speak, a multimodal LLM that leverages aligned visual, auditory, and textual data to enhance conversational AI by accurately predicting optimal response timing and type, achieving up to four times better timing accuracy than existing models.

Authors:Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, Najim Dehak, Shrikanth Narayanan
Title: Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
Abstract:
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: https://github.com/tiantiaf0627/vox-profile-release.
Chinese: Vox-Profile是一个基于语音基础模型的综合性基准,能够全面表征静态说话人特征和动态语音特性,并通过多数据集实验及语音识别分析与生成系统评估等下游应用验证其有效性。
English: Vox-Profile is a comprehensive benchmark that uses speech foundation models to holistically characterize both static speaker traits and dynamic speech properties, validated through experiments on multiple datasets and applications including ASR analysis and speech generation evaluation.

Authors:Anna C. Doris, Md Ferdous Alam, Amin Heyrani Nobari, Faez Ahmed
Title: CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation
Abstract:
Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: https://github.com/anniedoris/CAD-Coder.
中文:本文介绍了CAD-Coder这一视觉语言模型,它能从视觉输入生成可编辑的CAD代码,在语法有效性和三维相似度上超越现有模型,并展现出对真实世界图像的泛化能力。
English: This paper introduces CAD-Coder, a vision-language model that generates editable CAD code from visual inputs, outperforming existing models with perfect syntax validity and high 3D similarity while showing generalization to real-world images.

Authors:Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
Title: Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Abstract:
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
中文: 随着AI模型通过“对齐伪装”等新方法规避检测,识别风险愈发困难,因此我们开发了LitmusValues评估流程,通过分析AI在价值困境中的优先选择来预测其潜在危险行为。
English: As AI models evolve with tactics like Alignment Faking, detecting risks grows more difficult, prompting the development of LitmusValues, an evaluation pipeline that identifies AI value priorities to predict risky behaviors through dilemmas and real-world benchmarks.

Authors:Fnu Mohbat, Mohammed J Zaki
Title: KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models
Abstract:
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.
中文摘要:KERL系统创新性地将食物知识图谱与大语言模型结合,通过个性化推荐、菜谱生成和营养分析功能,经实验验证其性能显著优于现有方法,提供了完整的饮食解决方案。
English Summary: KERL is a novel system that integrates food knowledge graphs with large language models to deliver personalized food recommendations, generate detailed recipes, and provide nutritional analysis, outperforming existing methods through comprehensive evaluation.

Authors:Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Title: TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Abstract:
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
中文摘要:本文揭示了强化学习中验证器错误否定正确模型输出的普遍问题,并提出轻量级验证器TinyV来动态识别潜在误判,从而提升模型训练效果与收敛速度。
English Summary: This paper identifies the problem of false negatives in verifiers used for reinforcement learning of large language models, where correct answers are incorrectly rejected, and proposes TinyV, a lightweight verifier that mitigates this issue to improve training efficiency and accuracy.

Authors:Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken
Title: SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Abstract:
We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a puzzle using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-based and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. Our error analysis reveals systematic failures such as satisfiability bias, context inconsistency, and condition omission, highlighting limitations of current LLMs in search-based logical reasoning. Our code and data are publicly available at https://github.com/Anjiang-Wei/SATBench
中文: SATBench是一个通过基于布尔可满足性问题生成的逻辑谜题来评估大语言模型逻辑推理能力的基准,揭示了模型在搜索式推理中的显著局限性和系统性错误。
English: SATBench is a benchmark that assesses the logical reasoning of large language models using SAT-derived puzzles, revealing significant limitations in search-based reasoning and systematic errors like satisfiability bias.

Authors:Maksim Zhdanov, Vladislav Kurenkov
Title: Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic Potentials
Abstract:
Recent advances in neural network interatomic potentials have emerged as a promising research direction. However, popular deep learning models often lack auxiliary constraints grounded in physical laws, which could accelerate training and improve fidelity through physics-based regularization. In this work, we introduce $Φ$-Module, a universal plugin module that enforces Poisson's equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner. Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson's equation, making it possible to acquire a potential $\boldsymbolϕ$ and a corresponding charge density $\boldsymbolρ$ linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph. We then derive an electrostatic energy term, crucial for improved total energy predictions. This approach integrates seamlessly into any existing neural potential with insignificant computational overhead. Experiments on the OE62 and MD22 benchmarks confirm that models combined with $Φ$-Module achieve robust improvements over baseline counterparts. For OE62 error reduction ranges from 4.5\% to 17.8\%, and for MD22, baseline equipped with $Φ$-Module achieves best results on 5 out of 14 cases. Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient and lightweight in training. Code will be available at \href{https://github.com/dunnolab/phi-module}{dunnolab/phi-module}.
中文:$Φ$-模块是一种通用插件,通过在神经网络原子间势中强制执行泊松方程来自监督学习静电相互作用,以最小的计算开销在基准测试中显著提升了性能。
English: The $Φ$-Module is a universal plugin that enforces Poisson's equation in neural network interatomic potentials to learn electrostatic interactions self-supervised, improving performance on benchmarks with minimal computational overhead.

Authors:Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: Let LLMs Break Free from Overthinking via Self-Braking Tuning
Abstract:
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
大型推理模型(如OpenAI o1和DeepSeek-R1)通过生成长推理链提升了性能,但冗余推理导致计算成本高昂;我们提出的自制动调优框架使模型能自主调控推理过程,在保持精度的同时将令牌消耗降低高达60%。
Large reasoning models like OpenAI o1 and DeepSeek-R1 achieve strong performance through extended reasoning chains but suffer from computational inefficiency due to redundant steps, which our Self-Braking Tuning method addresses by enabling models to self-regulate reasoning length, cutting token use by up to 60% while preserving accuracy.

Authors:Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang
Title: Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
Abstract:
Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
Chinese: 大语言模型在生成科学假设方面展现出潜力,但因幻觉问题面临真实性挑战,为此开发了TruthHypo基准和KnowHD检测器,有效评估并筛选出准确假设。
English: Large language models show promise in generating scientific hypotheses but face challenges in truthfulness due to hallucinations, leading to the development of TruthHypo benchmark and KnowHD detector to evaluate and filter accurate hypotheses effectively.

Authors:Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Mingzheng Xu, Tianhao Cheng, Yixuan Wang, Zheng Chu, Shijie Xuyang, Zhiyuan Ma, YuanTao Fan, Wanxiang Che
Title: Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals
Abstract:
Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10\% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs' sensitivity to improve performance.
中文摘要:代码敏感性指代码大语言模型识别问题描述细微变化的能力,常被现有基准忽略;为此提出的CTF-Code基准和CTF-Instruct框架有效提升了模型敏感性,实验证明该方法显著提高了模型性能。
English Summary: Code sensitivity, the ability of Code LLMs to detect subtle changes in problem descriptions, is often overlooked in benchmarks, so the CTF-Code benchmark and CTF-Instruct framework were developed to enhance this sensitivity, resulting in significant performance improvements.

Authors:Isabella Degen, Zahraa S Abdallah, Henry W J Reeve, Kate Robson Brown
Title: CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering
Abstract:
Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is "more art than science" (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS's practical utility by identifying an algorithm's previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.
中文:时间序列聚类旨在揭示隐藏模式,但缺乏真实基准难以评估,因此我们开发了CSTS合成基准,它能区分相关结构退化与算法局限,从而精确诊断方法缺陷。
English: Time series clustering aims to reveal hidden patterns but faces evaluation challenges without ground truth, prompting the development of CSTS, a synthetic benchmark that isolates clustering failures and enables precise diagnosis of methodological limitations.

Authors:Lei Li, Xiao Zhou, Zheng Liu
Title: R2MED: A Benchmark for Reasoning-Driven Medical Retrieval
Abstract:
Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient's symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at https://github.com/R2MED/R2MED
中文摘要:R2MED是首个专为推理驱动医疗检索设计的基准,旨在弥补现有系统忽视临床决策复杂性的不足,其评估结果显示即使最先进模型也面临显著性能挑战。
English Summary: R2MED is the first benchmark designed for reasoning-driven medical retrieval, addressing the gap in current systems that overlook clinical decision-making's complexity, and it reveals significant performance challenges even for advanced models.

Authors:Chuanbo Tang, Zhuoyuan Li, Yifan Bian, Li Li, Dong Liu
Title: Neural Video Compression with Context Modulation
Abstract:
Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.
中文: 本文提出了一种通过流定向和上下文补偿两步调制方法,有效提升神经视频编码器中时间上下文的利用效率,相比H.266/VVC和现有先进方法实现了显著的码率节省。
English: This paper introduces a two-step modulation method using flow orientation and context compensation to enhance temporal context utilization in neural video codecs, achieving significant bitrate reductions compared to H.266/VVC and prior NVC methods.

Authors:Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang
Title: Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image
Abstract:
Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at https://github.com/Yuxuan-W/CP-GS.
中文:本文提出的CP-GS框架通过几何引导的渐进式外观传播和迭代微调,有效解决了单图像3D场景个性化中的视角偏差问题,在保持多视角一致性方面显著优于现有方法。
English: This paper introduces CP-GS, a framework that overcomes viewpoint bias in single-image 3D scene personalization by progressively propagating reference appearance to novel views using geometric guidance and iterative fine-tuning, achieving superior multi-view consistency compared to existing methods.

Authors:Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer, Behzad Bozorgtabar
Title: ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains
Abstract:
This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models -- an adaptive test-time model ensemble -- that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on scene-level corruption benchmarks (ImageNet-C, CIFAR-10/100-C), object-level style shifts (DomainNet-126, PACS), and semantic segmentation (Cityscapes->ACDC) covering recurring and continuously evolving domain shifts -- show that ReservoirTTA substantially improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods. Our code is publicly available at https://github.com/LTS5/ReservoirTTA.
中文: ReservoirTTA是一种新颖的插件框架,通过维护领域专用模型库来应对持续变化的测试领域,既能检测新领域又能实现领域特定适配,在长期域偏移场景中显著提升了适应精度并保持稳定性能。
English: ReservoirTTA is a plug-in framework that uses an adaptive ensemble of domain-specialized models to enable robust and stable test-time adaptation for continuously shifting domains, overcoming issues like catastrophic forgetting and outperforming existing methods across multiple benchmarks.

Authors:Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang
Title: Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Abstract:
Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at https://github.com/xuyang-liu16/VidCom2.
中文摘要:VidCom2框架通过基于帧独特性的自适应视觉令牌压缩,解决了视频大语言模型的效率问题,在显著降低延迟的同时保持了接近原始性能。
English Summary: The VidCom2 framework addresses efficiency issues in VideoLLMs by adaptively compressing visual tokens based on frame uniqueness, achieving near-original performance with significantly reduced latency.

Authors:Yuanbo Fang, Haoze Sun, Jun Liu, Tao Zhang, Zenan Zhou, Weipeng Chen, Xiaofen Xing, Xiangmin Xu
Title: S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
Abstract:
End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.
中文摘要:端到端语音大语言模型虽能直接处理音频,却存在智能退化问题;为此我们开发了S2SBench基准来量化性能差距,并通过Baichuan-Audio验证了其有效性。
English Summary: End-to-end speech LLMs enable direct audio processing but suffer from intelligence degradation, prompting the creation of S2SBench to evaluate performance gaps and analyze models like Baichuan-Audio.

Authors:Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao
Title: Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models
Abstract:
Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate $\textbf{Alignment}$ in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called $\textbf{LaTen}$ ($\textbf{L}$oc$\textbf{a}$te-$\textbf{T}$h$\textbf{e}$n-Alig$\textbf{n}$) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify $\textbf{Neural Incompatibility}$ as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.
中文摘要:大型语言模型支持参数知识迁移,本研究提出预对齐PKT和LaTen方法以高效对齐不同规模模型的参数空间,并揭示神经不兼容性是主要挑战。
English Summary: Large Language Models enable Parametric Knowledge Transfer (PKT), where this study introduces Pre-Align PKT and LaTen to align parametric spaces across scales efficiently, revealing Neural Incompatibility as a key challenge.

Authors:Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, Yunde Jia
Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
Abstract:
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
中文: 立体匹配在处理病态区域时存在困难,但利用视觉基础模型的无偏单目先验可提升泛化能力;通过引入二元局部排序图和自适应配准方法,有效解决了深度对齐不准和局部最优等融合问题。
English: Stereo matching struggles with ill-posed regions, but leveraging unbiased monocular priors from vision foundation models can enhance generalization, though fusion challenges like depth misalignment and local optima require solutions such as binary local ordering maps and adaptive registration techniques.

Authors:Paweł Batorski, Adrian Kosmala, Paul Swoboda
Title: PRL: Prompts from Reinforcement Learning
Abstract:
Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .
中文: 本文提出PRL,一种基于强化学习的自动提示生成方法,能够创建训练中未见过的全新少样本示例,并在文本分类、简化及摘要任务中实现了最先进的性能表现。
English: This paper introduces PRL, a reinforcement learning-based method for automatic prompt generation that creates novel few-shot examples unseen during training and achieves state-of-the-art performance across text classification, simplification, and summarization benchmarks.

Authors:Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu
Title: DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Abstract:
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
中文摘要:DeepEyes模型通过端到端强化学习实现了“图像思维”的交替多模态推理范式,在细粒度感知和推理任务中取得显著性能提升,同时展现出类人的视觉推理模式。
English Summary: The DeepEyes model introduces an interleaved multimodal reasoning paradigm that enables "thinking with images" through end-to-end reinforcement learning, achieving significant performance gains across perception and reasoning tasks while demonstrating human-like visual reasoning patterns.

Authors:Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, Shouhong Ding
Title: Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable
Abstract:
Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors. Our code is available at: https://github.com/roy-ch/Dual-Data-Alignment.
中文: 现有检测器因在带有偏见的训练集上学习而过度依赖虚假相关特征,但仅像素级对齐无法解决频域错位问题,因此提出的双数据对齐方法通过同时校准像素和频率显著提升了检测器的跨数据集泛化能力。
English: Current detectors trained on biased datasets often overfit to non-causal features, but pixel-level alignment alone fails to address frequency-level misalignments that reinforce spurious correlations, prompting the proposed Dual Data Alignment method which improves generalizability across multiple benchmarks.

Authors:Sho Inoue, Shai Wang, Haizhou Li
Title: PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs
Abstract:
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
中文: 该研究通过创建带自动标注的对话数据集并利用大语言模型预测对话个性,解决了语音数据中缺乏个性标注的问题,其系统比现有方法更符合人类判断。
English: The study addresses the lack of personality annotations in speech datasets by creating a dialogue dataset with automated annotations and using large language models to predict conversational personality, achieving better alignment with human judgments than existing methods.

Authors:Xiang Li, Xianfu Cheng, Dezhuang Miao, Xiaoming Zhang, Zhoujun Li
Title: TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis
Abstract:
Multimodal Sentiment Analysis (MSA) with missing modalities has attracted increasing attention recently. While current Transformer-based methods leverage dense text information to maintain model robustness, their quadratic complexity hinders efficient long-range modeling and multimodal fusion. To this end, we propose a novel and efficient Text-enhanced Fusion Mamba (TF-Mamba) framework for robust MSA with missing modalities. Specifically, a Text-aware Modality Enhancement (TME) module aligns and enriches non-text modalities, while reconstructing the missing text semantics. Moreover, we develop Text-based Context Mamba (TC-Mamba) to capture intra-modal contextual dependencies under text collaboration. Finally, Text-guided Query Mamba (TQ-Mamba) queries text-guided multimodal information and learns joint representations for sentiment prediction. Extensive experiments on three MSA datasets demonstrate the effectiveness and efficiency of the proposed method under missing modality scenarios. Our code is available at https://github.com/codemous/TF-Mamba.
中文: 本文提出TF-Mamba框架,通过文本感知模态增强和双模态交互机制,在缺失模态情况下有效实现多模态情感分析,实验证明该方法在三个数据集上具有优越性能。
English: This paper introduces TF-Mamba, an efficient framework for robust multimodal sentiment analysis with missing modalities, which aligns and enriches non-text data while reconstructing missing text semantics through specialized mamba modules for improved sentiment prediction.

Authors:Jinwang Song, Hongying Zan, Kunli Zhang, Lingling Mu, Yingjie Han, Haobo Hua, Min Peng
Title: JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling
Abstract:
Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency. Our code is available at https://github.com/Songjw133/JOLT-SQL.
中文:JOLT-SQL是一种简化的单阶段监督微调框架,通过联合优化模式链接和SQL生成,在基准测试中实现了最先进的执行准确率和效率提升。
English: JOLT-SQL is a streamlined single-stage supervised fine-tuning framework that jointly optimizes schema linking and SQL generation, achieving state-of-the-art execution accuracy and improved efficiency on benchmarks.

Authors:Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Title: X-KAN: Optimizing Local Kolmogorov-Arnold Networks via Evolutionary Rule-Based Machine Learning
Abstract:
Function approximation is a critical task in various fields. However, existing neural network approaches struggle with locally complex or discontinuous functions due to their reliance on a single global model covering the entire problem space. We propose X-KAN, a novel method that optimizes multiple local Kolmogorov-Arnold Networks (KANs) through an evolutionary rule-based machine learning framework called XCSF. X-KAN combines KAN's high expressiveness with XCSF's adaptive partitioning capability by implementing local KAN models as rule consequents and defining local regions via rule antecedents. Our experimental results on artificial test functions and real-world datasets demonstrate that X-KAN significantly outperforms conventional methods, including XCSF, Multi-Layer Perceptron, and KAN, in terms of approximation accuracy. Notably, X-KAN effectively handles functions with locally complex or discontinuous structures that are challenging for conventional KAN, using a compact set of rules (average 7.2 $\pm$ 2.3 rules). These results validate the effectiveness of using KAN as a local model in XCSF, which evaluates the rule fitness based on both accuracy and generality. Our X-KAN implementation is available at https://github.com/YNU-NakataLab/X-KAN.
中文: X-KAN是一种将科尔莫戈罗夫-阿诺德网络与进化框架相结合的新方法,通过优化局部模型,以紧凑的规则集显著优于传统方法,能有效处理复杂或不连续函数并实现高精度逼近。
English: X-KAN is a novel method that combines Kolmogorov-Arnold Networks with an evolutionary framework to optimize local models, significantly outperforming conventional approaches in handling complex or discontinuous functions with high accuracy using compact rule sets.

Authors:Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji
Title: Speculative Decoding Reimagined for Multimodal Large Language Models
Abstract:
This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.
中文: 本文提出多模态推测解码(MSD)方法,通过分离处理文本与视觉标记并采用两阶段训练策略,在保持精度的同时将多模态大语言模型推理速度提升最高达2.46倍。
English: This paper presents Multimodal Speculative Decoding (MSD), a method that accelerates Multimodal Large Language Models by separately processing text and visual tokens and employing a two-stage training strategy, achieving up to 2.46× speedup without accuracy loss.

Authors:Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Visual Agentic Reinforcement Fine-Tuning
Abstract:
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
中文: 本研究提出的视觉智能体强化微调(Visual-ARFT)方法显著提升了大型视觉语言模型执行网页搜索和图像编码处理的能力,在智能体任务中超越基线模型和GPT-4o,并在多跳问答基准测试中展现出强大的泛化性能。
English: This research introduces Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), which significantly enhances Large Vision-Language Models' ability to perform web searches and image manipulation through coding, outperforming baseline models and GPT-4o in agentic tasks and demonstrating strong generalization on multi-hop QA benchmarks.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Title: ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models
Abstract:
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.
中文: ABBA提出了一种新的参数高效微调架构,通过两个独立低秩矩阵的哈达玛积实现与预训练权重的完全解耦,在相同参数预算下显著提升表达能力,并在多项推理基准测试中大幅领先现有方法。
English: ABBA introduces a novel parameter-efficient fine-tuning architecture that decouples updates from pre-trained weights using two independent low-rank matrices, achieving superior expressivity and state-of-the-art performance across reasoning benchmarks.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Title: ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models
Abstract:
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.
中文: ABBA提出了一种新的参数高效微调架构,通过两个独立低秩矩阵的哈达玛积实现与预训练权重的完全解耦,在相同参数预算下显著提升表达能力,并在多项推理基准测试中大幅领先现有方法。
English: ABBA introduces a novel parameter-efficient fine-tuning architecture that decouples updates from pre-trained weights using two independent low-rank matrices, achieving superior expressivity and state-of-the-art performance across reasoning benchmarks.

Authors:Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma
Title: Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Abstract:
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.
Large language models' safety alignment is fragile and can be compromised during fine-tuning, as safety is deeply entangled with general learning components rather than isolated in distinct subspaces, limiting the effectiveness of subspace-based defenses.
English Summary:

Authors:Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma
Title: Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
Abstract:
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.
Large language models' safety alignment is fragile and can be compromised during fine-tuning, as safety is deeply entangled with general learning components rather than isolated in distinct subspaces, limiting the effectiveness of subspace-based defenses.
English Summary:

Authors:Tong Bao, Heng Zhang, Chengzhi Zhang
Title: Enhancing Abstractive Summarization of Scientific Papers Using Structure Information
Abstract:
Abstractive summarization of scientific papers has always been a research focus, yet existing methods face two main challenges. First, most summarization models rely on Encoder-Decoder architectures that treat papers as sequences of words, thus fail to fully capture the structured information inherent in scientific papers. Second, existing research often use keyword mapping or feature engineering to identify the structural information, but these methods struggle with the structural flexibility of scientific papers and lack robustness across different disciplines. To address these challenges, we propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers. In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition. A classifier is then trained to automatically identify the key structural components (e.g., Background, Methods, Results, Discussion), which provides a foundation for generating more balanced summaries. In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries. Experiments conducted on two domain-specific scientific paper summarization datasets demonstrate that our method outperforms advanced baselines, and generates more comprehensive summaries. The code and dataset can be accessed at https://github.com/tongbao96/code-for-SFR-AS.
Chinese: 本研究提出了一种两阶段摘要生成框架,通过分类器自动识别科学论文的结构功能,并利用Longformer生成上下文感知的摘要,实验表明该方法优于现有基准模型且能生成更全面的摘要内容。
English: This study introduces a two-stage abstractive summarization framework that first identifies structural functions in scientific papers using a classifier and then employs Longformer to generate context-aware summaries, outperforming existing methods and producing more comprehensive results.

Authors:Jiaming Li, Sheng Wang, Xin Wang, Yitao Zhu, Honglin Xiong, Zixu Zhuang, Qian Wang
Title: ReactDiff: Latent Diffusion for Facial Reaction Generation
Abstract:
Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \href{https://github.com/Hunan-Tiger/ReactDiff}{github}.
中文摘要:ReactDiff框架通过融合多模态变换器与潜在空间条件扩散,利用细粒度跨模态注意力机制生成既真实又多样的听者面部反应,在性能上显著超越现有方法。
English Summary: The ReactDiff framework integrates a Multi-Modality Transformer with conditional latent diffusion to generate realistic and diverse listener facial reactions by capturing fine-grained audio-visual interactions, significantly outperforming existing methods.

Authors:Chengzhi Zhang, Xinyi Yan, Lei Zhao, Yingyi Zhang
Title: Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information
Abstract:
The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches using Title and Abstract as input corpora. However, the semantic richness of keywords is significantly constrained by the length of the abstract. While full-text-based KPE can address this issue, it simultaneously introduces noise, which significantly diminishes KPE performance. To address this issue, this paper utilized the structural features and section texts obtained from the section structure information of academic articles to extract keyphrase from academic papers. The approach consists of two main parts: (1) exploring the effect of seven structural features on KPE models, and (2) integrating the extraction results from all section texts used as input corpora for KPE models via a keyphrase integration algorithm to obtain the keyphrase integration result. Furthermore, this paper also examined the effect of the classification quality of section structure on the KPE performance. The results show that incorporating structural features improves KPE performance, though different features have varying effects on model efficacy. The keyphrase integration approach yields the best performance, and the classification quality of section structure can affect KPE performance. These findings indicate that using the section structure information of academic articles contributes to effective KPE from academic articles. The code and dataset supporting this study are available at https://github.com/yan-xinyi/SSB_KPE.
中文: 本研究通过利用学术论文的结构特征和整合章节文本,改进了关键词提取方法,结果表明该方法能提升性能,尽管不同特征效果各异且依赖于章节分类质量。
English: This study enhances keyphrase extraction from academic papers by leveraging structural features and integrating section texts, demonstrating that this approach improves performance despite varying feature impacts and dependency on section classification quality.

Authors:Fan Liu, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, Hao Liu
Title: MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem
Abstract:
Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions. We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation. Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88\% improvement over human expert solutions while requiring only 15 minutes and \$0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (\textbf{top 2.0\% among 27,456 teams}) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot. Our code is available at https://github.com/usail-hkust/LLM-MM-Agent
中文摘要:本文提出的MM-Agent框架通过四阶段建模流程显著提升大语言模型的数学建模能力,在MM-Bench基准测试中表现优异,并在实际数学建模竞赛中验证了其作为建模助手的实用价值。
English Summary: This paper introduces MM-Agent, a framework that enhances LLMs' mathematical modeling capabilities through a four-stage process, achieving superior performance on the MM-Bench benchmark and proving effective in real-world competitions.

Authors:Yihang Du, Jiaying Hu, Suyang Hou, Yueyang Ding, Xiaobo Sun
Title: A Methodological Framework for Measuring Spatial Labeling Similarity
Abstract:
Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements. Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) \textit{as per} their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at https://github.com/YihDu/SLAM.
中文摘要:本研究提出了一种方法框架及其实现SLAM,通过综合考虑标签一致性、空间拓扑和错配标签影响来精确度量空间标注相似性,实验证明其性能优于现有评估指标。
English Summary: This study introduces a methodological framework and its implementation, SLAM, for accurately measuring spatial labeling similarity by considering label agreement, topology, and mismatched label impacts, demonstrating superior performance over existing metrics through experiments.

Authors:Hongjun Choi, Eun Som Jeon, Ankita Shukla, Pavan Turaga
Title: Intra-class Patch Swap for Self-Distillation
Abstract:
Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra-class-Patch-Swap.
中文摘要:本文提出了一种新颖的无教师知识蒸馏方法,通过类内图像块交换增强技术,在单一学生网络中实现有效的自蒸馏,无需架构修改或额外参数即可在多种计算机视觉任务中取得优越性能。
English Summary: The paper introduces a novel teacher-free knowledge distillation method using intra-class patch swap augmentation to enable effective self-distillation within a single student network, achieving superior performance across multiple computer vision tasks without architectural changes or additional parameters.

Authors:Tianle Gu, Zongqi Wang, Kexin Huang, Yuanqi Yao, Xiangliang Zhang, Yujiu Yang, Xiuying Chen
Title: Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking
Abstract:
Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it fails in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we develop a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99\% while achieving performance on par with state-of-the-art methods. Our work introduces a safe and efficient paradigm for low-entropy watermarking. https://github.com/Carol-gutianle/IE https://huggingface.co/datasets/Carol0110/IE-Tagger
中文摘要:提出的隐形熵水印方法通过轻量级特征提取器和自适应阈值,在不依赖原始大语言模型的情况下有效解决低熵场景水印难题,参数量减少99%的同时保持优异性能。
English Summary: The proposed Invisible Entropy (IE) watermarking method overcomes limitations of existing approaches by using a lightweight feature extractor and adaptive thresholding to efficiently watermark low-entropy text without relying on the original LLM, achieving 99% parameter reduction while maintaining performance.

Authors:Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
Title: DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
Abstract:
The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI's diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.
Chinese: 突破性的大语言模型在应对临床诊断等科学挑战方面展现出潜力,但当前先进模型在专业级诊断推理上仍面临困难,这通过它们在DiagnosisArena新基准测试中的低准确率得以体现。
English: Groundbreaking large language models show promise for tackling scientific challenges like clinical diagnostics, but current advanced models still struggle with professional-level diagnostic reasoning, as shown by their low accuracy on the new DiagnosisArena benchmark.

Authors:Zhenyu Li, Tianyi Shang, Pengjie Xu, Zhaojun Deng
Title: Place Recognition Meet Multiple Modalitie: A Comprehensive Review, Current Challenges and Future Directions
Abstract:
Place recognition is a cornerstone of vehicle navigation and mapping, which is pivotal in enabling systems to determine whether a location has been previously visited. This capability is critical for tasks such as loop closure in Simultaneous Localization and Mapping (SLAM) and long-term navigation under varying environmental conditions. In this survey, we comprehensively review recent advancements in place recognition, emphasizing three representative methodological paradigms: Convolutional Neural Network (CNN)-based approaches, Transformer-based frameworks, and cross-modal strategies. We begin by elucidating the significance of place recognition within the broader context of autonomous systems. Subsequently, we trace the evolution of CNN-based methods, highlighting their contributions to robust visual descriptor learning and scalability in large-scale environments. We then examine the emerging class of Transformer-based models, which leverage self-attention mechanisms to capture global dependencies and offer improved generalization across diverse scenes. Furthermore, we discuss cross-modal approaches that integrate heterogeneous data sources such as Lidar, vision, and text description, thereby enhancing resilience to viewpoint, illumination, and seasonal variations. We also summarize standard datasets and evaluation metrics widely adopted in the literature. Finally, we identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain. The unified framework of leading-edge place recognition methods, i.e., code library, and the results of their experimental evaluations are available at https://github.com/CV4RA/SOTA-Place-Recognitioner.
中文摘要:本综述系统回顾了位置识别领域的最新进展,重点分析了基于CNN的方法、Transformer框架和跨模态策略三大代表性技术范式,并探讨了当前研究挑战与未来发展方向。
English Summary: This survey comprehensively reviews recent advancements in place recognition for autonomous navigation, focusing on CNN-based methods, Transformer frameworks, and cross-modal strategies while addressing current challenges and future directions.

Authors:Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, Can Huang
Title: Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Abstract:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
Chinese: Dolphin提出了一种新颖的两阶段文档解析模型,通过首先生成布局元素再并行解析内容,结合异构锚点提示和大规模数据集,实现了顶尖的解析性能和效率。
English: Dolphin introduces a novel two-stage document parsing model that first analyzes layout elements and then parses content in parallel, achieving state-of-the-art performance and efficiency through heterogeneous anchor prompting and a large-scale dataset.

Authors:Yibo Gao, Hangqi Zhou, Zheyao Gao, Bomin Wang, Shangqi Gao, Sihan Wang, Xiahai Zhuang
Title: Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification
Abstract:
The pursuit of decision safety in clinical applications highlights the potential of concept-based methods in medical imaging. While these models offer active interpretability, they often suffer from concept leakages, where unintended information within soft concept representations undermines both interpretability and generalizability. Moreover, most concept-based models focus solely on local explanations (instance-level), neglecting the global decision logic (dataset-level). To address these limitations, we propose Concept Rule Learner (CRL), a novel framework to learn Boolean logical rules from binarized visual concepts. CRL employs logical layers to capture concept correlations and extract clinically meaningful rules, thereby providing both local and global interpretability. Experiments on two medical image classification tasks show that CRL achieves competitive performance with existing methods while significantly improving generalizability to out-of-distribution data. The code of our work is available at https://github.com/obiyoag/crl.
中文: 概念规则学习器(CRL)框架通过从二值化视觉概念中学习布尔逻辑规则,解决了医学影像中的概念泄漏和可解释性局限问题,在分类任务中实现了优异性能并显著提升了泛化能力。
English: The Concept Rule Learner (CRL) framework addresses concept leakage and limited interpretability in medical imaging by learning Boolean logical rules from binarized visual concepts, achieving competitive performance and enhanced generalizability in classification tasks.

Authors:Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
Title: Adversarially Pretrained Transformers may be Universally Robust In-Context Learners
Abstract:
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model's focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy--robustness trade-off and require a large number of in-context demonstrations. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
中文:对抗性预训练的Transformer可作为鲁棒基础模型,通过上下文学习无需额外对抗训练即可泛化至未知任务,但也存在准确率-鲁棒性权衡及依赖大量示例等局限性。
English: Adversarially pretrained transformers can serve as robust foundation models, enabling generalization to unseen tasks without further adversarial training through in-context learning, though they face limitations like accuracy-robustness trade-offs and dependency on numerous demonstrations.

Authors:Jesper Duemose Nielsen, Karthik Gopinath, Andrew Hoopes, Adrian Dalca, Colin Magdamo, Steven Arnold, Sudeshna Das, Axel Thielscher, Juan Eugenio Iglesias, Oula Puonti
Title: End-to-end Cortical Surface Reconstruction from Clinical Magnetic Resonance Images
Abstract:
Surface-based cortical analysis is valuable for a variety of neuroimaging tasks, such as spatial normalization, parcellation, and gray matter (GM) thickness estimation. However, most tools for estimating cortical surfaces work exclusively on scans with at least 1 mm isotropic resolution and are tuned to a specific magnetic resonance (MR) contrast, often T1-weighted (T1w). This precludes application using most clinical MR scans, which are very heterogeneous in terms of contrast and resolution. Here, we use synthetic domain-randomized data to train the first neural network for explicit estimation of cortical surfaces from scans of any contrast and resolution, without retraining. Our method deforms a template mesh to the white matter (WM) surface, which guarantees topological correctness. This mesh is further deformed to estimate the GM surface. We compare our method to recon-all-clinical (RAC), an implicit surface reconstruction method which is currently the only other tool capable of processing heterogeneous clinical MR scans, on ADNI and a large clinical dataset (n=1,332). We show a approximately 50 % reduction in cortical thickness error (from 0.50 to 0.24 mm) with respect to RAC and better recovery of the aging-related cortical thinning patterns detected by FreeSurfer on high-resolution T1w scans. Our method enables fast and accurate surface reconstruction of clinical scans, allowing studies (1) with sample sizes far beyond what is feasible in a research setting, and (2) of clinical populations that are difficult to enroll in research studies. The code is publicly available at https://github.com/simnibs/brainnet.
Chinese: 本研究提出了一种神经网络方法,能够从任意对比度和分辨率的临床MRI扫描中精确估计皮质表面,相较于现有工具将皮质厚度误差降低了50%,为更广泛的临床研究应用提供了可能。
English: This study introduces a neural network method that accurately estimates cortical surfaces from diverse clinical MRI scans of any contrast and resolution, achieving a 50% reduction in cortical thickness error compared to existing tools and enabling broader clinical research applications.

Authors:Zhidan Liu, Chengtang Yao, Jiaxi Zeng, Yuwei Wu, Yunde Jia
Title: Multi-Label Stereo Matching for Transparent Scene Depth Estimation
Abstract:
In this paper, we present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent scenes.Unlike previous methods that assume a unimodal distribution along the disparity dimension and formulate the matching as a single-label regression problem, we propose a multi-label regression formulation to estimate multiple depth values at the same pixel in transparent scenes. To resolve the multi-label regression problem, we introduce a pixel-wise multivariate Gaussian representation, where the mean vector encodes multiple depth values at the same pixel, and the covariance matrix determines whether a multi-label representation is necessary for a given pixel. The representation is iteratively predicted within a GRU framework. In each iteration, we first predict the update step for the mean parameters and then use both the update step and the updated mean parameters to estimate the covariance matrix. We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation. The experiments show that our method greatly improves the performance on transparent surfaces while preserving the background information for scene reconstruction. Code is available at https://github.com/BFZD233/TranScene.
中文摘要:本文提出一种多标签立体匹配方法,通过像素级高斯表示和GRU框架同时估计透明物体与背景的深度,基于合成数据集的实验表明该方法在保持背景信息的同时显著提升了透明表面的深度估计性能。
English Summary: This paper introduces a multi-label stereo matching method using pixel-wise Gaussian representations within a GRU framework to simultaneously estimate depths of transparent objects and backgrounds, demonstrating significant performance improvements through a synthesized dataset.

Authors:Marc Kaufeld, Korbinian Moller, Alessio Gambi, Paolo Arcaini, Johannes Betz
Title: MultiDrive: A Co-Simulation Framework Bridging 2D and 3D Driving Simulation for AV Software Validation
Abstract:
Scenario-based testing using simulations is a cornerstone of Autonomous Vehicles (AVs) software validation. So far, developers needed to choose between low-fidelity 2D simulators to explore the scenario space efficiently, and high-fidelity 3D simulators to study relevant scenarios in more detail, thus reducing testing costs while mitigating the sim-to-real gap. This paper presents a novel framework that leverages multi-agent co-simulation and procedural scenario generation to support scenario-based testing across low- and high-fidelity simulators for the development of motion planning algorithms. Our framework limits the effort required to transition scenarios between simulators and automates experiment execution, trajectory analysis, and visualization. Experiments with a reference motion planner show that our framework uncovers discrepancies between the planner's intended and actual behavior, thus exposing weaknesses in planning assumptions under more realistic conditions. Our framework is available at: https://github.com/TUM-AVS/MultiDrive
中文: 本文提出了一种新颖框架,可在低精度和高精度模拟器间实现无缝的场景测试,自动执行实验流程,并在更真实条件下揭示运动规划器行为的差异。
English: This paper introduces a novel framework that enables seamless scenario-based testing across low- and high-fidelity simulators, automating experiment processes and revealing discrepancies in motion planner behavior under realistic conditions.

Authors:Ziyang Zeng, Dun Zhang, Jiacheng Li, Panxiang Zou, Yudong Zhou, Yuqing Yang
Title: An Empirical Study of Position Bias in Modern Information Retrieval
Abstract:
This study investigates the position bias in information retrieval, where models tend to overemphasize content at the beginning of passages while neglecting semantically relevant information that appears later. To analyze the extent and impact of position bias, we introduce a new evaluation framework consisting of two position-aware retrieval benchmarks (SQuAD-PosQ, FineWeb-PosQ) and an intuitive diagnostic metric, the Position Sensitivity Index (PSI), for quantifying position bias from a worst-case perspective. We conduct a comprehensive evaluation across the full retrieval pipeline, including BM25, dense embedding models, ColBERT-style late-interaction models, and full-interaction reranker models. Our experiments show that when relevant information appears later in the passage, dense embedding models and ColBERT-style models suffer significant performance degradation (an average drop of 15.6%). In contrast, BM25 and reranker models demonstrate greater robustness to such positional variation. These findings provide practical insights into model sensitivity to the position of relevant information and offer guidance for building more position-robust retrieval systems. Code and data are publicly available at: https://github.com/NovaSearch-Team/position-bias-in-IR.
Chinese: 本研究探讨信息检索中的位置偏差,发现当关键信息出现在段落后部时,密集嵌入模型和ColBERT式模型性能显著下降,而BM25和重排序模型则表现出更强的鲁棒性。
English: This research examines position bias in information retrieval, revealing that dense embedding and ColBERT-style models experience significant performance drops when key information appears later in passages, while BM25 and reranker models show greater robustness.

Authors:Bao-Ngoc Dao, Quang Nguyen, Luyen Ngo Dinh, Minh Le, Nam Le, Linh Ngo Van
Title: Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting
Abstract:
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.
中文:WAVE++通过任务特定的提示池和标签描述,提升了持续关系抽取的适应性和分类能力,无需存储历史数据即可超越现有方法。
English: WAVE++ introduces task-specific prompt pools and label descriptions to enhance adaptability and classification in continual relation extraction, outperforming existing methods without storing past data.

Authors:Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
Title: CLEVER: A Curated Benchmark for Formally Verified Code Generation
Abstract:
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
中文: 本文介绍了${\rm C{\small LEVER}}$,这是一个包含161个问题的Lean验证代码生成基准,要求同时完成规范匹配和可证明实现,并通过Lean类型检查器严格验证正确性,避免了测试用例监督或逻辑泄露。
English: This paper introduces ${\rm C{\small LEVER}}$, a 161-problem benchmark for verified code generation in Lean that requires both specification matching and provable implementation, rigorously verified through Lean's type checker to ensure correctness without test-case supervision or leaked logic.

Authors:Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Title: EEG-to-Text Translation: A Model for Deciphering Human Brain Activity
Abstract:
With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.
Chinese: R1 Translator模型结合双向LSTM编码器和预训练Transformer解码器,在脑电信号转文本任务中显著提升性能,各项ROUGE指标、CER和WER均优于现有模型。
English: The R1 Translator model, integrating a bidirectional LSTM encoder with a transformer-based decoder, significantly enhances EEG-to-text decoding by outperforming existing models in ROUGE metrics, CER, and WER.

Authors:Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Title: LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
Abstract:
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
中文: LoVR基准通过引入更长的视频、高质量细粒度标注和可扩展的标注框架,解决了现有视频文本检索数据集的局限性,为先进模型提出了新的挑战。
English: The LoVR benchmark addresses limitations in existing video-text retrieval datasets by introducing longer videos, high-quality fine-grained captions, and a scalable annotation framework, presenting new challenges for advanced models.

Authors:Wanjing Huang, Weixiang Yan, Zhen Zhang, Ambuj Singh
Title: APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight
Abstract:
Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .
中文: APEX框架通过物理驱动的预见性和实时模拟增强大语言模型,有效克服动态交互建模的局限,在实际任务中显著优于标准模型。
English: APEX enhances LLMs with physics-driven foresight and real-time simulations to overcome limitations in dynamic interaction modeling, significantly improving performance in real-world tasks over standard models.

Authors:Yanheng He, Jiahe Jin, Pengfei Liu
Title: Efficient Agent Training for Computer Use
Abstract:
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.
中文: PC Agent-E 是一种高效的智能体训练框架,仅通过312条人工标注轨迹并结合合成动作决策,显著降低了对大规模人类示范数据的依赖,在基准测试中实现了141%的性能提升,并展现出强大的跨平台泛化能力。
English: PC Agent-E is an efficient training framework that significantly reduces the need for large-scale human demonstrations by using just 312 annotated trajectories enhanced with synthesized actions, achieving a 141% improvement on benchmarks and demonstrating strong cross-platform generalization.

Authors:Ruihan Liu, Xiaoyi Wu, Xijun Chen, Liang Hu, Yunjiang Lou
Title: 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision
Abstract:
A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at https://github.com/CLASS-Lab/4D-ROLLS.
中文: 本文提出的4D-ROLLS是首个基于4D雷达的弱监督占据估计方法,通过LiDAR生成的伪标签进行多阶段监督训练,在恶劣天气条件下表现出卓越的鲁棒性,并能快速推理及迁移至下游任务。
English: This paper introduces 4D-ROLLS, a weakly supervised method for 4D radar occupancy estimation that uses LiDAR-generated pseudo-labels to achieve robust performance in degraded weather conditions while enabling fast inference and seamless transfer to downstream tasks.

Authors:Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Title: Let's Verify Math Questions Step by Step
Abstract:
Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.
Chinese: 本文提出的MathQ-Verify通过五个阶段的流程,有效过滤定义不清的数学问题,包括验证格式、分解条件、检测矛盾及确保完整性,实现了最优性能并提升了数据集的可靠性。
English: This paper introduces MathQ-Verify, a five-stage pipeline that effectively filters ill-posed math problems by validating format, decomposing conditions, detecting contradictions, and ensuring completeness, achieving state-of-the-art performance and enhancing dataset reliability.

Authors:Yingwei Zhang, Ke Bu, Zhuoran Zhuang, Tao Xie, Yao Yu, Dong Li, Yang Guo, Detao Lv
Title: CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness
Abstract:
The past decades witness the significant advancements in time series forecasting (TSF) across various real-world domains, including e-commerce and disease spread prediction. However, TSF is usually constrained by the uncertainty dilemma of predicting future data with limited past observations. To settle this question, we explore the use of Cross-Future Behavior (CFB) in TSF, which occurs before the current time but takes effect in the future. We leverage CFB features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code is available at https://github.com/CRAFTinTSF/CRAFT.
中文: 本文提出CRAFT方法,通过挖掘跨未来行为趋势并利用Koopman预测器等模块补全数据,有效解决了时间序列预测中的不确定性难题,真实场景实验验证了其优越性。
English: This paper introduces CRAFT, a time series forecasting method that leverages cross-future behavior to address prediction uncertainty by extracting and supplementing trends through specialized modules, demonstrating effectiveness in real-world experiments.

Authors:Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie
Title: U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding
Abstract:
The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a Semantic-Aware Contrastive Loss Module, which explicitly identifies redundant audio features under language supervision and rectifies their semantic and spectral representations to enhance cross-modal alignment. Extensive experiments demonstrate that U-SAM consistently outperforms both specialized models and existing audio language models across multiple benchmarks. Moreover, it exhibits emergent capabilities on unseen tasks, showcasing its generalization potential. Code is available (https://github.com/Honee-W/U-SAM/).
Chinese: U-SAM是一种先进的音频语言模型,通过整合专用编码器和专家混合投影器实现动态特征融合,并采用语义感知对比损失模块增强跨模态对齐,在多种音频任务中表现优于现有模型。
English: U-SAM is an advanced audio language model that integrates specialized encoders and a Mixture of Experts projector for dynamic feature fusion, enhanced by a Semantic-Aware Contrastive Loss Module to improve cross-modal alignment and outperform existing models across diverse audio tasks.

Authors:Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim
Title: Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
Abstract:
Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.
Chinese: 近期专注于推理的语言模型虽能实现高准确率,但因冗长的推理路径导致内存使用增加和生成吞吐量下降,为此我们提出推理路径压缩(RPC)这一无需训练的方法,通过基于语义稀疏性压缩KV缓存来加速推理,在AIME 2024基准测试中使QwQ-32B的生成吞吐量提升高达1.60倍,而准确率仅下降1.2%。
English: Recent reasoning-focused language models achieve high accuracy but suffer from increased memory usage and reduced throughput due to lengthy reasoning paths, prompting the introduction of Reasoning Path Compression (RPC), a training-free method that accelerates inference by compressing the KV cache based on semantic sparsity, improving generation throughput by up to 1.60× with minimal accuracy loss.

Authors:Tian Sun, Yuqi Chen, Baihua Zheng, Weiwei Sun
Title: Learning Spatio-Temporal Dynamics for Trajectory Recovery via Time-Aware Transformer
Abstract:
In real-world applications, GPS trajectories often suffer from low sampling rates, with large and irregular intervals between consecutive GPS points. This sparse characteristic presents challenges for their direct use in GPS-based systems. This paper addresses the task of map-constrained trajectory recovery, aiming to enhance trajectory sampling rates of GPS trajectories. Previous studies commonly adopt a sequence-to-sequence framework, where an encoder captures the trajectory patterns and a decoder reconstructs the target trajectory. Within this framework, effectively representing the road network and extracting relevant trajectory features are crucial for overall performance. Despite advancements in these models, they fail to fully leverage the complex spatio-temporal dynamics present in both the trajectory and the road network. To overcome these limitations, we categorize the spatio-temporal dynamics of trajectory data into two distinct aspects: spatial-temporal traffic dynamics and trajectory dynamics. Furthermore, We propose TedTrajRec, a novel method for trajectory recovery. To capture spatio-temporal traffic dynamics, we introduce PD-GNN, which models periodic patterns and learns topologically aware dynamics concurrently for each road segment. For spatio-temporal trajectory dynamics, we present TedFormer, a time-aware Transformer that incorporates temporal dynamics for each GPS location by integrating closed-form neural ordinary differential equations into the attention mechanism. This allows TedFormer to effectively handle irregularly sampled data. Extensive experiments on three real-world datasets demonstrate the superior performance of TedTrajRec. The code is publicly available at https://github.com/ysygMhdxw/TEDTrajRec/.
中文: 本文提出TedTrajRec方法,通过PD-GNN建模时空交通动态和TedFormer变换器处理轨迹动态,有效提升了低采样率GPS轨迹的恢复效果,并在实验中展现出优越性能。
English: This paper introduces TedTrajRec, a novel method that enhances low-sampling-rate GPS trajectory recovery by modeling spatio-temporal traffic dynamics with PD-GNN and trajectory dynamics with the time-aware TedFormer Transformer, demonstrating superior performance in experiments.

Authors:Zhenyu Bao, Qing Li, Guibiao Liao, Zhongyuan Zhao, Kanglin Liu
Title: MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction
Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention in streamable dynamic novel view synthesis (DNVS) for its photorealistic rendering capability and computational efficiency. Despite much progress in improving rendering quality and optimization strategies, 3DGS-based streamable dynamic scene reconstruction still suffers from flickering artifacts and storage inefficiency, and struggles to model the emerging objects. To tackle this, we introduce MGStream which employs the motion-related 3D Gaussians (3DGs) to reconstruct the dynamic and the vanilla 3DGs for the static. The motion-related 3DGs are implemented according to the motion mask and the clustering-based convex hull algorithm. The rigid deformation is applied to the motion-related 3DGs for modeling the dynamic, and the attention-based optimization on the motion-related 3DGs enables the reconstruction of the emerging objects. As the deformation and optimization are only conducted on the motion-related 3DGs, MGStream avoids flickering artifacts and improves the storage efficiency. Extensive experiments on real-world datasets N3DV and MeetRoom demonstrate that MGStream surpasses existing streaming 3DGS-based approaches in terms of rendering quality, training/storage efficiency and temporal consistency. Our code is available at: https://github.com/pcl3dv/MGStream.
中文: MGStream通过引入运动相关的3D高斯模型处理动态场景、标准3D高斯模型处理静态场景,有效消除了闪烁伪影并提升了存储效率,在渲染质量和时间一致性方面优于现有流式3D高斯重建方法。
English: MGStream enhances streamable dynamic scene reconstruction by using motion-related 3D Gaussians for dynamic elements and vanilla 3D Gaussians for static ones, effectively reducing flickering artifacts and improving storage efficiency while achieving superior rendering quality and temporal consistency.

Authors:Matthew Raffel, Lizhong Chen
Title: FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer
Abstract:
The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. However, the KAN can be orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our characterizations reveal that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls and, more specifically, in the backward pass of GR-KAN caused by inefficient gradient accumulation. To address this memory bottleneck, we propose FlashKAT, which builds on our restructured kernel that minimizes gradient accumulation with atomic adds and accesses to slow memory. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the coefficient gradients. Our code is available at https://github.com/OSU-STARLAB/FlashKAT.
中文: Kolmogorov-Arnold Transformer (KAT) 因梯度累积中的内存瓶颈导致训练速度大幅下降,而 FlashKAT 通过重构内核减少内存延迟,实现了 86.5 倍的加速并提高了梯度精度。
English: The Kolmogorov-Arnold Transformer (KAT) suffers from significant training slowdowns due to memory bottlenecks in gradient accumulation, which FlashKAT addresses by restructuring the kernel to minimize memory stalls, achieving an 86.5x speedup while improving gradient accuracy.

Authors:Guoheng Sun, Ziyao Wang, Bowei Tian, Meng Liu, Zheyu Shen, Shwai He, Yexiao He, Wanghao Ye, Yiting Wang, Ang Li
Title: CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs
Abstract:
As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.
Chinese: CoIn框架通过验证隐藏推理令牌的数量和语义有效性,解决了增强推理能力的大型语言模型服务中缺乏透明度的问题,能以高达94.7%的成功率检测令牌计数膨胀,从而恢复计费透明度。
English: The CoIn framework addresses the lack of transparency in reasoning-enhanced LLM services by verifying both the quantity and semantic validity of hidden reasoning tokens, effectively detecting token count inflation with up to 94.7% success rate to restore billing transparency.

Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Title: Backward Conformal Prediction
Abstract:
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tildeα}(X_{\rm test})) \ge 1 - \mathbb{E}[\tildeα]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tildeα$, and (ii) a novel leave-one-out estimator $\hatα^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tildeα]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
中文: 反向共形预测通过基于观测数据自适应控制预测集大小来确保共形覆盖,利用e值有效性和新颖的留一估计器保持实际可计算性,特别适用于医学诊断等大预测集不实用的场景。
English: Backward Conformal Prediction ensures conformal coverage by adaptively controlling prediction set sizes based on observed data, utilizing e-value validity and a novel leave-one-out estimator to maintain practical computability, especially in applications like medical diagnosis where large sets are impractical.

Authors:Zihan Chen, Jiakang Li, Minghao Guo, Henry Chen, Zirui Li, Joel Bierman, Yipeng Huang, Huiyang Zhou, Yuan Liu, Eddy Z. Zhang
Title: Genesis: A Compiler Framework for Hamiltonian Simulation on Hybrid CV-DV Quantum Computers
Abstract:
This paper introduces Genesis, the first compiler designed to support Hamiltonian Simulation on hybrid continuous-variable (CV) and discrete-variable (DV) quantum computing systems. Genesis is a two-level compilation system. At the first level, it decomposes an input Hamiltonian into basis gates using the native instruction set of the target hybrid CV-DV quantum computer. At the second level, it tackles the mapping and routing of qumodes/qubits to implement long-range interactions for the gates decomposed from the first level. Rather than a typical implementation that relies on SWAP primitives similar to qubit-based (or DV-only) systems, we propose an integrated design of connectivity-aware gate synthesis and beamsplitter SWAP insertion tailored for hybrid CV-DV systems. We also introduce an OpenQASM-like domain-specific language (DSL) named CVDV-QASM to represent Hamiltonian in terms of Pauli-exponentials and basic gate sequences from the hybrid CV-DV gate set. Genesis has successfully compiled several important Hamiltonians, including the Bose-Hubbard model, $\mathbb{Z}_2-$Higgs model, Hubbard-Holstein model, Heisenberg model and Electron-vibration coupling Hamiltonians, which are critical in domains like quantum field theory, condensed matter physics, and quantum chemistry. Our implementation is available at Genesis-CVDV-Compiler(https://github.com/ruadapt/Genesis-CVDV-Compiler).
中文: Genesis是首个支持混合连续-离散变量量子系统的编译器,采用两级编译结构将哈密顿量分解为基础门操作,并通过连接感知的门合成与定制领域专用语言实现高效量子模拟。
English: Genesis is the first compiler for hybrid continuous-variable and discrete-variable quantum systems, featuring a two-level compilation process that decomposes Hamiltonians into native gates and optimizes their implementation with connectivity-aware synthesis and a custom domain-specific language.

Authors:Barkin Dagda, Muhammad Awais, Saber Fallah
Title: GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching
Abstract:
Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at https://github.com/CAV-Research-Lab/GeoVLM
Chinese: 本文提出GeoVLM,一种新颖的可训练重排序方法,利用视觉语言模型的零样本能力和可解释的跨视角语言描述,通过提高最佳匹配精度来改进跨视角地理定位,超越了现有最优方法。
English: This paper introduces GeoVLM, a novel trainable reranking method that leverages vision language models' zero-shot capabilities and interpretable cross-view language descriptions to enhance cross-view geo-localization by improving top-match accuracy over state-of-the-art approaches.

Authors:Fynn Fromme, Hans Harder, Christine Allen-Blanchette, Sebastian Peitz
Title: Surrogate Modeling of 3D Rayleigh-Benard Convection with Equivariant Autoencoders
Abstract:
The use of machine learning for modeling, understanding, and controlling large-scale physics systems is quickly gaining in popularity, with examples ranging from electromagnetism over nuclear fusion reactors and magneto-hydrodynamics to fluid mechanics and climate modeling. These systems - governed by partial differential equations - present unique challenges regarding the large number of degrees of freedom and the complex dynamics over many scales both in space and time, and additional measures to improve accuracy and sample efficiency are highly desirable. We present an end-to-end equivariant surrogate model consisting of an equivariant convolutional autoencoder and an equivariant convolutional LSTM using $G$-steerable kernels. As a case study, we consider the three-dimensional Rayleigh-Bénard convection, which describes the buoyancy-driven fluid flow between a heated bottom and a cooled top plate. While the system is E(2)-equivariant in the horizontal plane, the boundary conditions break the translational equivariance in the vertical direction. Our architecture leverages vertically stacked layers of $D_4$-steerable kernels, with additional partial kernel sharing in the vertical direction for further efficiency improvement. We demonstrate significant gains in sample and parameter efficiency, as well as a better scaling to more complex dynamics. The accompanying code is available under https://github.com/FynnFromme/equivariant-rb-forecasting.
机器学习正广泛应用于建模和控制由偏微分方程主导的大规模物理系统,我们提出的等变替代模型在三维瑞利-贝纳德对流等复杂动力学中显著提升了准确性和效率。
Machine learning is increasingly applied to model and control large-scale physics systems governed by partial differential equations, with our proposed equivariant surrogate model enhancing accuracy and efficiency in complex dynamics like three-dimensional Rayleigh-Bénard convection.

Authors:Yiru Jiao, Simeon C. Calvert, Sander van Cranenburgh, Hans van Lint
Title: Learning collision risk proactively from naturalistic driving data at scale
Abstract:
Accurately and proactively alerting drivers or automated systems to emerging collisions is crucial for road safety, particularly in highly interactive and complex urban environments. However, existing approaches to identifying potential collisions either require labour-intensive annotation of sparse risk, struggle to consider varying contextual factors, or are only useful in specific scenarios. To address these limits, this study introduces the Generalised Surrogate Safety Measure (GSSM), a new data-driven approach that learns collision risk exclusively from naturalistic driving without the need for crash or risk labels. GSSM captures the patterns of normal driving and estimates the extent to which a traffic interaction deviates from the norm towards an unsafe state. Diverse data from naturalistic driving, including motion kinematics, weather, lighting, etc., are used to train multiple GSSMs, which are tested with 2,591 reconstructed real-world crashes and near-crashes. These test events are also released here as the largest dataset of its kind to date. A basic GSSM using only instantaneous motion kinematics achieves an area under the precision-recall curve of 0.9 and secures a median time advance of 2.6 seconds to prevent potential collisions. Additional interaction patterns and contextual factors provide further performance gains. Across various types of collision risk scenarios (such as rear-end, merging, and turning interactions), the accuracy and timeliness of GSSM consistently outperforms existing baselines. GSSM therefore establishes a scalable, context-aware, and generalisable foundation for proactively quantifying collision risk in traffic interactions. This can support and facilitate autonomous driving systems, traffic safety assessment, and road emergency management. Code and experiment data are openly accessible at https://github.com/Yiru-Jiao/GSSM.
中文: 本研究提出的广义替代安全度量(GSSM)通过无标签自然驾驶数据学习碰撞风险,在多种碰撞场景中均能实现精准及时的预警,其性能全面优于现有方法,为自动驾驶和交通安全评估提供了可扩展的解决方案。
English: This study introduces the Generalised Surrogate Safety Measure (GSSM), a data-driven method that learns collision risk from naturalistic driving data without requiring labeled incidents, achieving high accuracy and timeliness in predicting various collision scenarios while outperforming existing approaches.

Authors:Pengxin Guo, Yinong Wang, Wei Li, Mengting Liu, Ming Li, Jinkai Zheng, Liangqiong Qu
Title: Exploring Federated Pruning for Large Language Models
Abstract:
LLM pruning has emerged as a promising technology for compressing LLMs, enabling their deployment on resource-limited devices. However, current methodologies typically require access to public calibration samples, which can be challenging to obtain in privacy-sensitive domains. To address this issue, we introduce FedPrLLM, a comprehensive federated pruning framework designed for the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs to calculate a pruning mask matrix based on its local calibration data and share it with the server to prune the global model. This approach allows for collaborative pruning of the global model with the knowledge of each client while maintaining local data privacy. Additionally, we conduct extensive experiments to explore various possibilities within the FedPrLLM framework, including different comparison groups, pruning strategies, and the decision to scale weights. Our extensive evaluation reveals that one-shot pruning with layer comparison and no weight scaling is the optimal choice within the FedPrLLM framework. We hope our work will help guide future efforts in pruning LLMs in privacy-sensitive fields. Our code is available at https://github.com/Pengxin-Guo/FedPrLLM.
Chinese: FedPrLLM提出了一种联邦剪枝框架,通过客户端基于本地校准数据计算剪枝掩码并与服务器共享,实现了大型语言模型的隐私保护压缩,无需公开校准样本,有效维护了数据隐私。
English: FedPrLLM introduces a federated pruning framework that enables privacy-preserving compression of large language models by allowing clients to compute pruning masks locally and share them with a server, eliminating the need for public calibration data and ensuring data confidentiality.

Authors:Jessica Foo, Pradyumna Shyama Prasad, Shaun Khoo
Title: Know Or Not: a library for evaluating out-of-knowledge base robustness
Abstract:
While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness. knowornot comprises four main features. Firstly, it provides a unified, high-level API that streamlines the process of setting up and running robustness benchmarks. Secondly, its modular architecture emphasizes extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings. Thirdly, its rigorous data modeling design ensures experiment reproducibility, reliability and traceability. Lastly, it implements a comprehensive suite of tools for users to customize their pipelines. We demonstrate the utility of knowornot by developing a challenging benchmark, PolicyBench, which spans four Question-Answer (QA) chatbots on government policies, and analyze its OOKB robustness. The source code of knowornot is available https://github.com/govtech-responsibleai/KnowOrNot.
中文摘要:本文提出了一种新方法和开源工具knowornot,用于系统评估检索增强生成场景下大语言模型的超知识库鲁棒性,无需人工标注即可解决高风险应用中大模型的幻觉问题。
English Summary: This paper introduces a novel methodology and an open-source library called knowornot for systematically evaluating the out-of-knowledge base robustness of large language models in retrieval-augmented generation settings, addressing hallucination risks in high-stakes applications without requiring manual annotations.

Authors:Rodrigo Fritz, Pablo Suárez-Serrato, Victor Mijangos, Anayanzi D. Martinez-Hernandez, Eduardo Ivan Velazquez Richards
Title: EuLearn: A 3D database for learning Euler characteristics
Abstract:
We present EuLearn, the first surface datasets equitably representing a diversity of topological types. We designed our embedded surfaces of uniformly varying genera relying on random knots, thus allowing our surfaces to knot with themselves. EuLearn contributes new topological datasets of meshes, point clouds, and scalar fields in 3D. We aim to facilitate the training of machine learning systems that can discern topological features. We experimented with specific emblematic 3D neural network architectures, finding that their vanilla implementations perform poorly on genus classification. To enhance performance, we developed a novel, non-Euclidean, statistical sampling method adapted to graph and manifold data. We also introduce adjacency-informed adaptations of PointNet and Transformer architectures that rely on our non-Euclidean sampling strategy. Our results demonstrate that incorporating topological information into deep learning workflows significantly improves performance on these otherwise challenging EuLearn datasets.
中文: EuLearn提供了首个公平代表多种拓扑类型的表面数据集,通过创新的非欧几里得采样方法和改进的PointNet与Transformer架构,显著提升了深度学习模型在拓扑特征识别上的性能。
English: EuLearn introduces the first equitable surface datasets with diverse topological types, enhancing machine learning systems' ability to discern topological features through novel non-Euclidean sampling and adapted architectures like PointNet and Transformer.

Authors:Dan Ofer, Michal Linial, Dafna Shahaf
Title: InterFeat: A Pipeline for Finding Interesting Scientific Features
Abstract:
Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize "interestingness" as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40--53% of our top candidates were validated as interesting, compared to 0--7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing "interestingness" scalably and for any target. We release data and code: https://github.com/LinialLab/InterFeat
中文: 该研究开发了一种自动化流程,融合机器学习、知识图谱和大语言模型,从生物医学数据中发掘新颖、实用且合理的新假设,能提前数年识别风险因素,并获得远高于基线方法的高专家验证率。
English: This study introduces an automated pipeline that combines machine learning, knowledge graphs, and large language models to discover novel, useful, and plausible hypotheses in biomedical data, successfully identifying risk factors years ahead of literature and achieving high expert validation rates.

Authors:Zhipeng Hou, Junyi Tang, Yipeng Wang
Title: HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems
Abstract:
Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks. The code repository is available at https://github.com/23japhone/HALO.
中文: HALO框架采用分层多智能体架构,通过动态角色设计和蒙特卡洛树搜索优化工作流,在代码生成、常识推理和数学解题等专业任务中实现性能突破,较现有最佳基线平均提升14.4%。
English: The HALO framework introduces a hierarchical multi-agent system with adaptive role design and structured workflow search to overcome the limitations of static agent architectures, achieving significant performance improvements on specialized tasks across multiple benchmarks.

Authors:Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen
Title: IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80\% of the time, and answer correctly 55.8\% of the time compared to 76.2\% in English for the best-performing model. We release IRLBench (https://huggingface.co/datasets/ReliableAI/IRLBench) and an accompanying evaluation codebase (https://github.com/ReML-AI/IRLBench) to enable future research on robust, culturally aware multilingual AI development.
中文:IRLBench是基于爱尔兰毕业考试开发的双语评测基准,通过长文本生成任务评估大语言模型在英语和爱尔兰语中的表现,揭示了显著性能差距,旨在推动具有文化意识的多语言人工智能研究发展。
English: IRLBench is a new bilingual benchmark developed from Irish Leaving Certificate exams to evaluate LLMs' long-form generation in both English and Irish, revealing significant performance gaps and promoting culturally aware multilingual AI research.

Authors:Xingyuan Lu, Yuxi Liu, Dongyu Zhang, Zhiyao Wu, Jing Ren, Feng Xia
Title: EmoMeta: A Multimodal Dataset for Fine-grained Emotion Classification in Chinese Metaphors
Abstract:
Metaphors play a pivotal role in expressing emotions, making them crucial for emotional intelligence. The advent of multimodal data and widespread communication has led to a proliferation of multimodal metaphors, amplifying the complexity of emotion classification compared to single-mode scenarios. However, the scarcity of research on constructing multimodal metaphorical fine-grained emotion datasets hampers progress in this domain. Moreover, existing studies predominantly focus on English, overlooking potential variations in emotional nuances across languages. To address these gaps, we introduce a multimodal dataset in Chinese comprising 5,000 text-image pairs of metaphorical advertisements. Each entry is meticulously annotated for metaphor occurrence, domain relations and fine-grained emotion classification encompassing joy, love, trust, fear, sadness, disgust, anger, surprise, anticipation, and neutral. Our dataset is publicly accessible (https://github.com/DUTIR-YSQ/EmoMeta), facilitating further advancements in this burgeoning field.
中文摘要:隐喻对情感智能至关重要,但多模态数据集匮乏且研究多限于英语,为此我们构建了包含5000个标注文本-图像对的中文多模态隐喻广告数据集,以推动该领域发展。
English Summary: Metaphors are essential for emotional intelligence, but the lack of multimodal datasets and cross-linguistic studies hinders progress, so we created a publicly available Chinese dataset of 5,000 annotated text-image metaphorical advertisements to advance this field.

Authors:Avinash Patil, Siru Tao, Amardeep Gedhu
Title: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale
Abstract:
Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.
中文摘要:本研究评估了六种大型语言模型使用C-SSRS量表进行自杀风险评估的能力,发现Claude和GPT与人工标注最为接近,同时强调了部署过程中的伦理考量。
English Summary: This study evaluates six large language models for automated suicide risk assessment using the C-SSRS scale, finding that Claude and GPT align best with human ratings while highlighting ethical considerations for deployment.

Authors:Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
Abstract:
Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.
中文: RISE作为一种在线强化学习框架,通过结果验证的整合反馈机制,同步提升大语言模型的解题能力与自我核查技能。
English: RISE is an online reinforcement learning framework that trains large language models to simultaneously enhance problem-solving accuracy and self-verification skills through integrated feedback from outcome verification.

Authors:Ruoyu Wang, Yi Ma, Shenghua Gao
Title: Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos
Abstract:
Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at https://github.com/Dwawayu/Pensieve.
中文: 本文提出一种两阶段自监督方法,通过先学习隐式场景表示再采用显式3D高斯建模进行优化,实现了无需相机参数或几何先验的未标定图像新视角合成。
English: This paper introduces a two-stage self-supervised method that enables novel view synthesis from uncalibrated images by first learning implicit scene representations and then refining them with explicit 3D Gaussian modeling, eliminating the need for camera parameters or geometric priors.

Authors:Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao
Title: MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
Abstract:
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.
Chinese: MM-PRM是一种通过自动化框架开发的流程奖励模型,通过提供步骤级监督来增强多模态推理能力,显著提升了在领域内和领域外基准测试中的表现。
English: MM-PRM is a process reward model developed through an automated framework to enhance multimodal reasoning by providing step-level supervision, which significantly improves performance on both in-domain and out-of-domain benchmarks.

Authors:Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang
Title: G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
Abstract:
Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.
Chinese: 视觉语言模型在交互环境中存在“知行差距”,VLM-Gym通过强化学习驱动的训练解决了这一问题,其开发的模型通过感知与推理的相互促进,全面超越了主流专有模型。
English: Vision-Language Models face a "knowing-doing" gap in interactive environments, which VLM-Gym addresses by enabling RL-driven training of models that outperform leading proprietary ones through mutually bootstrapped perception and reasoning.

Authors:Zhuozhao Hu, Kaishen Yuan, Xin Liu, Zitong Yu, Yuan Zong, Jingang Shi, Huanjing Yue, Jingyu Yang
Title: FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning
Abstract:
Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at https://github.com/953206211/FEALLM.
Chinese: 面部情感分析通过引入新的指令数据集和FEALLM模型,利用详细的面部动作单元描述和因果推理关系,显著提升了任务的可解释性与泛化能力,在多项基准测试中展现出卓越性能。
English: Facial Emotion Analysis is advanced by a new instruction dataset and FEALLM model, which improve interpretability and generalization through detailed facial action unit descriptions and causal reasoning, achieving robust performance across multiple benchmarks.

Authors:Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li
Title: AdaptThink: Reasoning Models Can Learn When to Think
Abstract:
Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.
中文:AdaptThink是一种新颖的强化学习算法,能根据任务难度自适应选择思考与无思考模式,在多种推理任务中显著降低推理成本的同时提升性能表现。
English: AdaptThink is a novel reinforcement learning algorithm that adaptively selects between thinking and no-thinking modes based on task difficulty, significantly reducing inference costs while improving performance across various reasoning tasks.

Authors:Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang
Title: VSA: Faster Video Diffusion with Trainable Sparse Attention
Abstract:
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
中文:VSA提出了一种可训练的稀疏注意力机制,通过聚焦关键令牌来降低计算需求,在不牺牲性能的前提下实现了视频扩散模型的高效扩展。
English: VSA introduces a trainable sparse attention mechanism that reduces computational demands by focusing on critical tokens, enabling efficient scaling of video diffusion models without sacrificing performance.

Authors:David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, Genta Indra Winata
Title: R3: Robust Rubric-Agnostic Reward Models
Abstract:
Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.
中文: 提出的$\shortmethodname$框架采用了一种与评分标准无关、可推广的奖励建模方法,通过提供可解释的评分分配来增强语言模型与多样化人类偏好对齐的透明度和灵活性。
English: The proposed $\shortmethodname$ framework introduces a rubric-agnostic, generalizable reward modeling approach that provides interpretable score assignments to enhance transparency and flexibility in aligning language models with diverse human preferences.

Authors:Nam V. Nguyen, Huy Nguyen, Quang Pham, Van Nguyen, Savitha Ramasamy, Nhat Ho
Title: CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
Abstract:
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526
Chinese: 提出的CompeteSMoE算法引入了一种竞争机制,将令牌路由至具有最高神经响应的专家,相比现有SMoE策略,在训练大型语言模型时展现出更优的样本效率和性能,同时保持较低训练开销。
English: The proposed CompeteSMoE algorithm introduces a competition mechanism that routes tokens to experts with the highest neural response, demonstrating superior sample efficiency and performance in training large language models with low overhead compared to existing SMoE strategies.

Authors:Gongfan Fang, Xinyin Ma, Xinchao Wang
Title: Thinkless: LLM Learns When to Think
Abstract:
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless
中文: Thinkless框架通过强化学习让语言模型自适应选择简短或复杂推理链,在多个基准测试中将冗余的长链推理减少50%-90%,显著提升了推理效率。
English: The Thinkless framework enables LLMs to adaptively choose between short and long reasoning chains using reinforcement learning, significantly reducing unnecessary complex reasoning by 50%-90% while maintaining performance across multiple benchmarks.

Authors:Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot
Title: One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling
Abstract:
Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model KDM, a novel offline distillation approach grounded in Koopman theory-a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. Empirically, KDM achieves state-of-the-art performance across standard offline distillation benchmarks, improving FID scores by up to 40% in a single generation step. All implementation details and code for the experimental setups are provided in our GitHub - https://github.com/azencot-group/KDM, or in our project page - https://sites.google.com/view/koopman-distillation-model.
中文总结:Koopman蒸馏模型(KDM)基于Koopman理论提出离线蒸馏框架,通过线性算子实现单步图像生成并保持语义保真度,在基准测试中实现FID指标最高40%的性能提升。
English Summary: The Koopman Distillation Model (KDM) introduces an offline distillation framework using Koopman theory to enable single-step image generation while maintaining semantic fidelity, achieving state-of-the-art performance with up to 40% FID improvement.

Authors:Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang, Zhao, Yu, Cao, Yu Cheng, Tianlong Chen
Title: Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
Abstract:
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than $1.5\times$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$.
中文: Occult系统通过优化专家协作通信策略,减少混合专家模型中的通信开销,在保持模型质量的同时实现超过1.5倍的加速效果。
English: The proposed Occult system reduces communication overhead in mixture-of-experts models by optimizing collaborative communication through strategic expert placement and collaboration pruning, achieving over 1.5× speedup while maintaining model quality.

Authors:Paula Feldman, Martin Sinnona, Claudio Delrieux, Viviana Siless, Emmanuel Iarussi
Title: VesselGPT: Autoregressive Modeling of Vascular Geometry
Abstract:
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code is available at https://github.com/LIA-DiTella/VesselGPT-MICCAI.
中文摘要:本文提出一种结合VQ-VAE与GPT-2的自回归方法,通过离散化表征和B样条参数化实现高保真血管树合成,在保留形态细节的同时首次实现了血管的自回归生成。
English Summary: This paper introduces an autoregressive method using VQ-VAE and GPT-2 to synthesize anatomical trees, achieving high-fidelity reconstruction with compact representations while preserving morphological details through B-spline parameterization.

Authors:Gabriele Spadaro, Alberto Presta, Jhony H. Giraldo, Marco Grangetto, Wei Hu, Giuseppe Valenzise, Attilio Fiandrotti, Enzo Tartaglione
Title: Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates
Abstract:
Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a "Denoising Diffusion Probabilistic Model" (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at https://github.com/EIDOSLAB/DDPM-PCC.
中文: 本文提出一种基于DDPM的点云压缩方法,通过PointNet编码器和可学习向量量化器在低比特率下实现高效压缩,在基准数据集上展现出优于现有技术的率失真性能。
English: This paper introduces a DDPM-based point cloud compression method that achieves efficient low-bit-rate performance by using a PointNet encoder and learnable vector quantizer, demonstrating superior rate-distortion results on benchmark datasets.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, Wanxiang Che
Title: RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning
Abstract:
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.
Chinese: 为解决思维链推理在评估和优化中的挑战,提出了推理边界框架++(RBF++),通过定义可测量的性能极限和引入处理不可测量能力(如多模态感知)的创新机制,在多种任务和模型中得到验证。
English: The Reasoning Boundary Framework++ (RBF++) is introduced to address challenges in evaluating and optimizing Chain-of-Thought (CoT) reasoning by defining measurable performance limits and handling unmeasurable capabilities like multimodal perception through innovative mechanisms, validated across diverse tasks and models.

Authors:Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona
Title: I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models
Abstract:
Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs' propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: https://github.com/3lis/misinfo_vlm
中文: 研究表明,图像会显著增加视觉语言模型对虚假新闻的转发,其增幅远超真实新闻,且人格特质与政治倾向会进一步影响该行为,凸显了多模态AI系统的风险并亟需针对性防护措施。
English: This study reveals that images significantly increase vision-language models' resharing of false news more than true news, with persona traits and political alignments further influencing this behavior, highlighting risks in multimodal AI systems and underscoring the need for targeted safeguards.

Authors:Gabriel de Albuquerque Gleizer
Title: Output behavior equivalence and simultaneous subspace identification of systems and faults
Abstract:
We address the problem of identifying a system subject to additive faults, while simultaneously reconstructing the fault signal via subspace methods. We do not require nominal data for the identification, neither do we impose any assumption on the class of faults, e.g., sensor or actuator faults. We show that, under mild assumptions on the fault signal, standard PI-MOESP can recover the system matrices associated to the input-output subsystem. Then we introduce the concept of output behavior equivalence, which characterizes systems with the same output behavior set, and present a method to establish this equivalence from system matrices. Finally, we show how to estimate from data the complete set of fault matrices for which there exist a fault signal with minimal dimension that explains the data.
中文: 本研究提出了一种基于子空间的方法,无需标称数据或故障类型假设即可实现系统辨识和故障信号重构,通过PI-MOESP恢复系统矩阵,并引入输出行为等价概念从数据中估计故障矩阵集。
English: This study presents a subspace-based method for system identification and fault signal reconstruction without requiring nominal data or assumptions on fault types, demonstrating the use of PI-MOESP to recover system matrices and introducing output behavior equivalence to estimate fault matrices from data.

Authors:Yifu Cai, Xinyu Li, Mononito Goswami, Michał Wiliński, Gus Welter, Artur Dubrawski
Title: TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents
Abstract:
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.
中文: TimeSeriesGym是一个可扩展的基准测试框架,通过整合多领域任务和灵活评估机制,全面评估AI代理在时序机器学习工程挑战中的综合能力,显著提升了评估的实用性和扩展性。
English: TimeSeriesGym is a scalable benchmarking framework designed to comprehensively evaluate AI agents on diverse time series machine learning engineering challenges, incorporating multi-domain tasks and flexible evaluation mechanisms for enhanced practical relevance.

Authors:Anthony Zhou, Amir Barati Farimani
Title: Neural Functional: Learning Function to Scalar Maps for Neural PDE Surrogates
Abstract:
Many architectures for neural PDE surrogates have been proposed in recent years, largely based on neural networks or operator learning. In this work, we derive and propose a new architecture, the Neural Functional, which learns function to scalar mappings. Its implementation leverages insights from operator learning and neural fields, and we show the ability of neural functionals to implicitly learn functional derivatives. For the first time, this allows for an extension of Hamiltonian mechanics to neural PDE surrogates by learning the Hamiltonian functional and optimizing its functional derivatives. We demonstrate that the Hamiltonian Neural Functional can be an effective surrogate model through improved stability and conserving energy-like quantities on 1D and 2D PDEs. Beyond PDEs, functionals are prevalent in physics; functional approximation and learning with its gradients may find other uses, such as in molecular dynamics or design optimization.
Chinese: 哈密顿神经求解器(HNS)通过神经网络表示哈密顿泛函,将哈密顿力学扩展到神经偏微分方程求解器中,从而在1D和2D系统中实现了更高的稳定性和能量类守恒量的保持。
English: The Hamiltonian Neural Solver (HNS) extends Hamiltonian mechanics to neural PDE solvers by representing the Hamiltonian functional with a neural field, enabling improved stability and conservation of energy-like quantities across 1D and 2D systems.

Authors:Anthony Zhou, Amir Barati Farimani
Title: Hamiltonian Neural PDE Solvers through Functional Approximation
Abstract:
Designing neural networks within a Hamiltonian framework offers a principled way to ensure that conservation laws are respected in physical systems. While promising, these capabilities have been largely limited to discrete, analytically solvable systems. In contrast, many physical phenomena are governed by PDEs, which govern infinite-dimensional fields through Hamiltonian functionals and their functional derivatives. Building on prior work, we represent the Hamiltonian functional as a kernel integral parameterized by a neural field, enabling learnable function-to-scalar mappings and the use of automatic differentiation to calculate functional derivatives. This allows for an extension of Hamiltonian mechanics to neural PDE solvers by predicting a functional and learning in the gradient domain. We show that the resulting Hamiltonian Neural Solver (HNS) can be an effective surrogate model through improved stability and conserving energy-like quantities across 1D and 2D PDEs. This ability to respect conservation laws also allows HNS models to better generalize to longer time horizons or unseen initial conditions.
Chinese: 哈密顿神经求解器(HNS)通过神经网络表示哈密顿泛函,将哈密顿力学扩展到神经偏微分方程求解器中,从而在1D和2D系统中实现了更高的稳定性和能量类守恒量的保持。
English: The Hamiltonian Neural Solver (HNS) extends Hamiltonian mechanics to neural PDE solvers by representing the Hamiltonian functional with a neural field, enabling improved stability and conservation of energy-like quantities across 1D and 2D systems.

Authors:Lei Sheng, Shuai-Shuai Xu
Title: CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72\% execution accuracy, while the 32B model achieves 73.67\%. The code has been open sourced at https://github.com/CycloneBoy/csc_sql.
中文: CSC-SQL方法融合自我一致性和自我修正技术,通过选择高频输出进行修订并采用强化学习微调模型,在BIRD基准测试中实现了超过71%的执行准确率。
English: The CSC-SQL method combines Self-Consistency and Self-Correction techniques to improve SQL generation accuracy by selecting top outputs for revision and fine-tuning models with reinforcement learning, achieving over 71% execution accuracy on benchmarks.

Authors:Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, Yangqiu Song
Title: From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Abstract:
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: https://github.com/HKUST-KnowComp/Awesome-LLM-Scientific-Discovery.
中文摘要:大语言模型正通过从特定任务工具演变为自主智能体,重塑科研流程与人机协作,本综述构建了分类体系并展望了未来发展路径,以推动人工智能驱动的科学发现。
English Summary: Large Language Models are transforming scientific discovery by evolving from task-specific tools into autonomous agents, redefining research processes and human-AI collaboration, with this survey providing a taxonomy and strategic foresight for future advancements.

Authors:Jieying Xue, Phuong Minh Nguyen, Minh Le Nguyen, Xin Liu
Title: JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models
Abstract:
With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnote{Our code is available at https://github.com/yingjie7/mlingual_multilabel_emo_detection.
中文摘要:本研究通过采用预训练模型和创新分类方法应对多语言多标签情绪检测挑战,在SemEval-2025任务11中于多语言环境下取得领先性能。
English Summary: This study tackles multilingual multi-label emotion detection by employing pre-trained models and innovative classification methods, achieving top performance across multiple languages in SemEval-2025 Task 11.

Authors:Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
Title: SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Abstract:
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
中文摘要:该研究提出SAKURA基准测试,发现大型音频语言模型即使能正确提取语音/音频信息,仍难以进行多跳推理,揭示了多模态整合的关键缺陷。
English Summary: The study introduces SAKURA, a benchmark revealing that large audio-language models struggle with multi-hop reasoning despite correctly extracting speech/audio information, exposing a critical limitation in multimodal integration.

Authors:Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang
Title: From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
Abstract:
Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \href{https://github.com/BIT-DA/ABS}{\textcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.
中文摘要:提出的ABS方法通过注意力引导裁剪和特征选择,解决了随机增强带来的问题,无需训练即可在视觉语言模型中实现零样本性能的显著提升,达到领先水平。
English Summary: The proposed ABS method enhances zero-shot performance in vision-language models by using attention-guided cropping and feature selection to address issues from random augmentations, achieving state-of-the-art results without training.

Authors:Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, Jinwoo Shin
Title: StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment
Abstract:
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these descriptions. We leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding features. Extensive experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
中文: 本文提出StarFT框架,通过语言模型生成干扰性文本标签并采用正则化对齐输出分布,防止模型学习虚假特征,在多种场景下显著提升了最差组和平均分类准确率。
English: This paper introduces StarFT, a novel fine-tuning framework that enhances model robustness by preventing the learning of spurious features through regularization aligning outputs with spuriosity-injected labels generated by language models, significantly improving worst-group and average accuracy in various scenarios.

Authors:Qingling Shu, Sibao Chen, Xiao Wang, Zhihui You, Wei Lu, Jin Tang, Bin Luo
Title: Semantic Change Detection of Roads and Bridges: A Fine-grained Dataset and Multimodal Frequency-driven Detector
Abstract:
Accurate detection of road and bridge changes is crucial for urban planning and transportation management, yet presents unique challenges for general change detection (CD). Key difficulties arise from maintaining the continuity of roads and bridges as linear structures and disambiguating visually similar land covers (e.g., road construction vs. bare land). Existing spatial-domain models struggle with these issues, further hindered by the lack of specialized, semantically rich datasets. To fill these gaps, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset. As the first benchmark to systematically target semantic change detection of roads and bridges, RB-SCD offers comprehensive fine-grained annotations for 11 semantic change categories. This enables a detailed analysis of traffic infrastructure evolution. Building on this, we propose a novel framework, the Multimodal Frequency-Driven Change Detector (MFDCD). MFDCD integrates multimodal features in the frequency domain through two key components: (1) the Dynamic Frequency Coupler (DFC), which leverages wavelet transform to decompose visual features, enabling it to robustly model the continuity of linear transitions; and (2) the Textual Frequency Filter (TFF), which encodes semantic priors into frequency-domain graphs and applies filter banks to align them with visual features, resolving semantic ambiguities. Experiments demonstrate the state-of-the-art performance of MFDCD on RB-SCD and three public CD datasets. The code will be available at https://github.com/DaGuangDaGuang/RB-SCD.
中文摘要:本研究提出了RB-SCD数据集和MFDCD框架,通过频域多模态特征处理线性连续性和语义模糊问题,显著提升了道路桥梁变化检测性能,取得了领先的实验效果。
English Summary: The study introduces the RB-SCD dataset and the MFDCD framework, which uses frequency-domain multimodal features to enhance the detection of road and bridge changes by addressing linear continuity and semantic ambiguity issues, achieving state-of-the-art results.

Authors:Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Siran Zhang, Tingting Liu, Xianping Yin, Xiaoyu Yang, Xin Song, Xuan Hu, Yankai Zhang, Yuqiao Li
Title: MAGI-1: Autoregressive Video Generation at Scale
Abstract:
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
中文: MAGI-1是一种自回归世界模型,通过预测连续帧片段生成视频,在文本引导的图像转视频任务中表现出色,具有高时间一致性和可扩展的实时部署能力。
English: MAGI-1 is an autoregressive world model that generates videos by predicting sequential chunks of frames, achieving strong performance in text-conditioned image-to-video tasks with high temporal consistency and scalable, real-time deployment.

Authors:Yuzhen Chen, Hojun Son, Arpan Kusari
Title: MatPredict: a dataset and benchmark for learning material properties of diverse indoor objects
Abstract:
Determining material properties from camera images can expand the ability to identify complex objects in indoor environments, which is valuable for consumer robotics applications. To support this, we introduce MatPredict, a dataset that combines the high-quality synthetic objects from Replica dataset with MatSynth dataset's material properties classes - to create objects with diverse material properties. We select 3D meshes of specific foreground objects and render them with different material properties. In total, we generate \textbf{18} commonly occurring objects with \textbf{14} different materials. We showcase how we provide variability in terms of lighting and camera placement for these objects. Next, we provide a benchmark for inferring material properties from visual images using these perturbed models in the scene, discussing the specific neural network models involved and their performance based on different image comparison metrics. By accurately simulating light interactions with different materials, we can enhance realism, which is crucial for training models effectively through large-scale simulations. This research aims to revolutionize perception in consumer robotics. The dataset is provided \href{https://huggingface.co/datasets/UMTRI/MatPredict}{here} and the code is provided \href{https://github.com/arpan-kusari/MatPredict}{here}.
中文: 本研究推出MatPredict数据集,通过融合合成物体与多样化材质属性来提升消费级机器人从图像中识别材质的能力,建立了神经网络性能基准,旨在革新机器人感知技术。
English: This research introduces MatPredict, a dataset combining synthetic objects with diverse material properties to enhance material recognition from images for consumer robotics, providing benchmarks for neural network performance and aiming to revolutionize robotic perception.

Authors:Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang
Title: Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Abstract:
We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.
中文摘要:SLED是一种创新的语音语言模型,采用连续潜在表示和能量距离目标进行高效自回归训练,无需量化或复杂层级结构,在零样本和流式语音合成中均表现出色。
English Summary: SLED is a novel speech language model that uses continuous latent representations and an energy distance objective for efficient autoregressive training, eliminating the need for quantization and complex hierarchies while achieving strong performance in zero-shot and streaming synthesis.

Authors:Zihao Cheng, Hongru Wang, Zeming Liu, Yuhang Guo, Yuanfang Guo, Yunhong Wang, Haifeng Wang
Title: ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models
Abstract:
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at https://github.com/Chengziha0/ToolSpectrum.
中文摘要:本文提出ToolSpectrum基准,通过评估用户画像和环境因素对工具选择的影响,解决增强型大语言模型缺乏情境感知个性化的问题,研究表明尽管个性化能提升用户体验,现有模型仍难以平衡这两个维度。
English Summary: This paper introduces ToolSpectrum, a benchmark addressing the lack of context-aware personalization in tool-augmented LLMs by evaluating how user profiles and environmental factors impact tool selection, revealing that while personalization improves user experience, current models struggle to balance these dimensions effectively.

Authors:Yassine El Boudouri, Walter Nuninger, Julian Alvarez, Yvan Peter
Title: Role-Playing Evaluation for Large Language Models
Abstract:
Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval
中文摘要:作者提出角色扮演评估(RPEval)这一新基准,通过情感理解、决策能力、道德对齐和角色一致性四个维度评估大语言模型的角色扮演能力,以解决现有评估方法的不足。
English Summary: The authors introduce Role-Playing Eval (RPEval), a novel benchmark to assess LLM role-playing capabilities across emotional understanding, decision-making, moral alignment, and in-character consistency, addressing limitations in current evaluation methods.

Authors:Francesco Innocenti, El Mehdi Achour, Christopher L. Buckley
Title: $μ$PC: Scaling Predictive Coding to 100+ Layer Networks
Abstract:
The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth-$μ$P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call "$μ$PC". Through an extensive analysis of the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, $μ$PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, $μ$PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results have implications for other local algorithms and could be extended to convolutional and transformer architectures. Code for $μ$PC is made available as part of a JAX library for PCNs at https://github.com/thebuckleylab/jpc (Innocenti et al., 2024).
中文: 本研究提出$μ$PC方法,采用深度-$μ$P参数化技术,成功解决了预测编码网络在深层训练中的不稳定问题,实现了超过100层网络的稳定训练,并在分类任务中取得具有竞争力的性能且无需大量调参。
English: The study introduces $μ$PC, a method using Depth-$μ$P parameterization that enables stable training of over 100-layer predictive coding networks, addressing instabilities and achieving competitive performance on classification tasks with minimal tuning.

Authors:Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao
Title: Lightweight Transformer via Unrolling of Mixed Graph Algorithms for Traffic Forecast
Abstract:
To forecast traffic with both spatial and temporal dimensions, we unroll a mixed-graph-based optimization algorithm into a lightweight and interpretable transformer-like neural net. Specifically, we construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We formulate a prediction problem for the future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We construct an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$, which are akin to the self-attention mechanism in classical transformers. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically. Our code is available in https://github.com/SingularityUndefined/Unrolling-GSP-STForecast.
Chinese: 我们通过展开混合图优化算法,构建了一个轻量级、可解释的类Transformer神经网络,用于捕捉交通预测中的时空依赖关系,在显著减少参数的同时实现了与先进方法相媲美的性能。
English: We develop a lightweight, interpretable transformer-like neural network by unrolling a mixed-graph optimization algorithm that captures spatial and temporal dependencies for traffic forecasting, achieving competitive performance with significantly fewer parameters.

Authors:Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia
Title: 3D Visual Illusion Depth Estimation
Abstract:
3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a robust depth estimation framework that uses common sense from a vision-language model to adaptively select reliable depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
中文: 本研究揭示机器视觉系统在深度估计任务中同样会受到三维视觉错觉的严重误导,并提出一种利用视觉语言常识的鲁棒框架,其性能超越了现有方法。
English: This study demonstrates that machine vision systems are significantly deceived by 3D visual illusions in depth estimation tasks, and proposes a robust framework leveraging vision-language common sense to outperform existing methods.

Authors:Hongrui Kou, Jingkai Li, Ziyu Wang, Zhouhang Lv, Yuxin Zhang, Cheng Wang
Title: PPTNet: A Hybrid Periodic Pattern-Transformer Architecture for Traffic Flow Prediction and Congestion Identification
Abstract:
Accurate prediction of traffic flow parameters and real time identification of congestion states are essential for the efficient operation of intelligent transportation systems. This paper proposes a Periodic Pattern Transformer Network (PPTNet) for traffic flow prediction, integrating periodic pattern extraction with the Transformer architecture, coupled with a fuzzy inference method for real-time congestion identification. Firstly, a high-precision traffic flow dataset (Traffic Flow Dataset for China's Congested Highways and Expressways, TF4CHE) suitable for congested highway scenarios in China is constructed based on drone aerial imagery data. Subsequently, the proposed PPTNet employs Fast Fourier Transform to capture multi-scale periodic patterns and utilizes two-dimensional Inception convolutions to efficiently extract intra and inter periodic features. A Transformer decoder dynamically models temporal dependencies, enabling accurate predictions of traffic density and speed. Finally, congestion probabilities are calculated in real-time using the predicted outcomes via a Mamdani fuzzy inference-based congestion identification module. Experimental results demonstrate that the proposed PPTNet significantly outperforms mainstream traffic prediction methods in prediction accuracy, and the congestion identification module effectively identifies real-time road congestion states, verifying the superiority and practicality of the proposed method in real-world traffic scenarios. Project page: https://github.com/ADSafetyJointLab/PPTNet.
中文摘要:本文提出PPTNet模型,结合周期模式提取与Transformer架构进行交通流预测,并采用模糊推理方法实时识别拥堵状态,在中国高速公路数据集上验证了其优越性能。
English Summary: This paper introduces PPTNet, a model combining periodic pattern analysis with Transformer architecture for accurate traffic flow prediction and real-time congestion identification using fuzzy inference, validated on a specialized Chinese highway dataset.

Authors:Xiao Wu, Xiaoqing Zhang, Zunjie Xiao, Lingxi Hu, Risa Higashita, Jiang Liu
Title: Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields in Efficient CNNs for Fair Medical Image Classification
Abstract:
Efficient convolutional neural network (CNN) architecture design has attracted growing research interests. However, they typically apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on the classification results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk when employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bag to capture lesion characteristics with varying significances effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our proposed ERoHPRF maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The code of this paper is available at https://github.com/XiaoLing12138/Expert-Like-Reparameterization-of-Heterogeneous-Pyramid-Receptive-Fields.
中文: 本研究提出ERoHPRF方法,通过模拟多专家会诊模式,利用异构金字塔感受野有效提升医学图像分类的性能与公平性,同时保持计算效率的竞争优势。
English: The study introduces ERoHPRF, a novel approach using heterogeneous pyramid receptive fields to enhance both performance and fairness in medical image classification by mimicking multi-expert consultation, while maintaining computational efficiency.

Authors:Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xue, Emmanouil Benetos, Kai Yu, Eng-Siong Chng, Xie Chen
Title: MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Abstract:
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
中文: MMAR是一个新颖的基准,用于评估音频语言模型的深度推理能力,包含1000个高质量音频-问题-答案三元组,涵盖广泛真实场景和四个层次化推理层级,并标注思维链原理以推动音频推理研究发展。
English: MMAR is a novel benchmark for evaluating deep reasoning in Audio-Language Models, featuring 1,000 high-quality audio-question-answer triplets spanning diverse real-world scenarios and four hierarchical reasoning layers, with annotated Chain-of-Thought rationales to advance audio reasoning research.

Authors:Yicheng Xiao, Lin Song, Yukang Chen, Yingmin Luo, Yuxin Chen, Yukang Gan, Wei Huang, Xiu Li, Xiaojuan Qi, Ying Shan
Title: MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Abstract:
Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at https://github.com/TencentARC/MindOmni
中文摘要:MindOmni是一种统一的多模态大语言模型,通过强化学习和三阶段训练策略解决了文本到图像系统在处理多模态输入和复杂推理任务时的局限性,在理解和生成基准测试中表现卓越,尤其展现出先进的数学推理能力。
English Summary: MindOmni is a unified multimodal large language model that overcomes limitations in text-to-image systems by integrating reasoning generation through reinforcement learning and a three-phase training strategy, achieving superior performance on understanding and generation benchmarks with advanced mathematical reasoning capabilities.

Authors:Federico Del Pup, Andrea Zanola, Louis Fabrice Tshimanga, Alessandra Bertoldo, Livio Finos, Manfredo Atzori
Title: The role of data partitioning on the performance of EEG-based deep learning models in supervised cross-subject analysis: a preliminary study
Abstract:
Deep learning is significantly advancing the analysis of electroencephalography (EEG) data by effectively discovering highly nonlinear patterns within the signals. Data partitioning and cross-validation are crucial for assessing model performance and ensuring study comparability, as they can produce varied results and data leakage due to specific signal properties (e.g., biometric). Such variability leads to incomparable studies and, increasingly, overestimated performance claims, which are detrimental to the field. Nevertheless, no comprehensive guidelines for proper data partitioning and cross-validation exist in the domain, nor is there a quantitative evaluation of their impact on model accuracy, reliability, and generalizability. To assist researchers in identifying optimal experimental strategies, this paper thoroughly investigates the role of data partitioning and cross-validation in evaluating EEG deep learning models. Five cross-validation settings are compared across three supervised cross-subject classification tasks (BCI, Parkinson's, and Alzheimer's disease detection) and four established architectures of increasing complexity (ShallowConvNet, EEGNet, DeepConvNet, and Temporal-based ResNet). The comparison of over 100,000 trained models underscores, first, the importance of using subject-based cross-validation strategies for evaluating EEG deep learning models, except when within-subject analyses are acceptable (e.g., BCI). Second, it highlights the greater reliability of nested approaches (N-LNSO) compared to non-nested counterparts, which are prone to data leakage and favor larger models overfitting to validation data. In conclusion, this work provides EEG deep learning researchers with an analysis of data partitioning and cross-validation and offers guidelines to avoid data leakage, currently undermining the domain with potentially overestimated performance claims.
中文: 深度学习通过识别复杂模式提升了脑电图分析能力,但数据划分和交叉验证方法对模型可靠性及泛化性影响显著,建议采用基于受试者的嵌套验证策略以避免数据泄露和性能高估问题。
English: Deep learning enhances EEG analysis by detecting complex patterns, but data partitioning and cross-validation methods significantly impact model reliability and generalizability, with subject-based and nested approaches recommended to prevent data leakage and overestimated performance.

Authors:Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, Luu Anh Tuan
Title: EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Abstract:
Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.
中文:EffiBench-X是首个多语言代码效率评测基准,覆盖六种编程语言,研究发现大模型生成的代码虽功能正确但效率显著低于人类专家,平均仅达人类效率的62%,且存在明显的语言差异性。
English: EffiBench-X is the first multi-language benchmark to evaluate code efficiency across six programming languages, revealing that LLMs produce functionally correct but significantly less efficient code than human experts, achieving only 62% of human efficiency on average with notable variations between languages.

Authors:Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu
Title: DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
Abstract:
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos and codes are available at: https://dualcodec.github.io
中文: 本研究提出DualCodec双流编码方法,通过整合自监督学习和波形表示来增强编码器令牌的语义信息,在低帧率下保持高音质,从而提升语音生成效率。
English: This study introduces DualCodec, a dual-stream encoding approach that integrates self-supervised and waveform representations to enhance semantic information in codec tokens, enabling high audio quality at low frame rates for efficient speech generation.

Authors:Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu
Title: DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
Abstract:
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: https://dualcodec.github.io, code is available at: https://github.com/jiaqili3/DualCodec
中文: 本研究提出DualCodec双流编码方法,通过整合自监督学习和波形表示来增强编码器令牌的语义信息,在低帧率下保持高音质,从而提升语音生成效率。
English: This study introduces DualCodec, a dual-stream encoding approach that integrates self-supervised and waveform representations to enhance semantic information in codec tokens, enabling high audio quality at low frame rates for efficient speech generation.

Authors:Lorena Garcia-Foncillas Macias, Aaron Kujawa, Aya Elshalakany, Jonathan Shapey, Tom Vercauteren
Title: A generalisable head MRI defacing pipeline: Evaluation on 2,566 meningioma scans
Abstract:
Reliable MRI defacing techniques to safeguard patient privacy while preserving brain anatomy are critical for research collaboration. Existing methods often struggle with incomplete defacing or degradation of brain tissue regions. We present a robust, generalisable defacing pipeline for high-resolution MRI that integrates atlas-based registration with brain masking. Our method was evaluated on 2,566 heterogeneous clinical scans for meningioma and achieved a 99.92 per cent success rate (2,564/2,566) upon visual inspection. Excellent anatomical preservation is demonstrated with a Dice similarity coefficient of 0.9975 plus or minus 0.0023 between brain masks automatically extracted from the original and defaced volumes. Source code is available at https://github.com/cai4cai/defacing_pipeline.
中文: 本文提出了一种鲁棒且通用的MRI去身份化流程,通过整合基于图谱的配准和脑部掩膜技术,在保护患者隐私的同时有效保留脑部解剖结构,在临床扫描中实现了99.92%的成功率并展现出卓越的解剖完整性。
English: This paper introduces a robust and generalizable MRI defacing pipeline that ensures patient privacy by effectively removing facial features while preserving brain anatomy, achieving a 99.92% success rate on clinical scans and demonstrating excellent anatomical integrity.

Authors:Vinkle Srivastav, Juliette Puel, Jonathan Vappou, Elijah Van Houten, Paolo Cabras, Nicolas Padoy
Title: A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation
Abstract:
Transcranial focused ultrasound (tFUS) is an emerging modality for non-invasive brain stimulation and therapeutic intervention, offering millimeter-scale spatial precision and the ability to target deep brain structures. However, the heterogeneous and anisotropic nature of the human skull introduces significant distortions to the propagating ultrasound wavefront, which require time-consuming patient-specific planning and corrections using numerical solvers for accurate targeting. To enable data-driven approaches in this domain, we introduce TFUScapes, the first large-scale, high-resolution dataset of tFUS simulations through anatomically realistic human skulls derived from T1-weighted MRI images. We have developed a scalable simulation engine pipeline using the k-Wave pseudo-spectral solver, where each simulation returns a steady-state pressure field generated by a focused ultrasound transducer placed at realistic scalp locations. In addition to the dataset, we present DeepTFUS, a deep learning model that estimates normalized pressure fields directly from input 3D CT volumes and transducer position. The model extends a U-Net backbone with transducer-aware conditioning, incorporating Fourier-encoded position embeddings and MLP layers to create global transducer embeddings. These embeddings are fused with U-Net encoder features via feature-wise modulation, dynamic convolutions, and cross-attention mechanisms. The model is trained using a combination of spatially weighted and gradient-sensitive loss functions, enabling it to approximate high-fidelity wavefields. The TFUScapes dataset is publicly released to accelerate research at the intersection of computational acoustics, neurotechnology, and deep learning. The project page is available at https://github.com/CAMMA-public/TFUScapes.
中文: 本研究推出了TFUScapes——首个基于真实颅骨解剖的大规模经颅聚焦超声模拟数据集,并开发了DeepTFUS深度学习模型,能够通过CT影像和换能器位置直接估算压力场,显著提升靶向精准度。
English: The study introduces TFUScapes, a large-scale dataset of transcranial focused ultrasound simulations through human skulls, and DeepTFUS, a deep learning model that efficiently estimates pressure fields from CT scans and transducer positions to improve targeting accuracy.

Authors:Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong
Title: Fractured Chain-of-Thought Reasoning
Abstract:
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.
中文:Fractured Sampling是一种创新的推理时策略,通过调整推理轨迹、截断深度和解决方案数量,在无需重新训练大语言模型的情况下,优化了推理准确性与计算成本之间的平衡,实现了更高的效率。
English: Fractured Sampling is a new inference-time strategy that optimizes the trade-off between reasoning accuracy and computational cost by adjusting reasoning trajectories, truncation depth, and solution counts, achieving superior efficiency without retraining LLMs.

Authors:Shanshan Liu, Noriki Nishida, Rumana Ferdous Munne, Narumi Tokunaga, Yuki Yamagata, Kouji Kozaki, Yuji Matsumoto
Title: MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition
Abstract:
Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.
中文摘要:MA-COIR框架通过将生物医学概念识别重构为索引任务,利用语义搜索标识符有效识别显性和隐性概念,无需在推理过程中进行提及级标注,显著提升了本体驱动概念识别的性能。
English Summary: MA-COIR is a novel framework that reframes biomedical concept recognition as an indexing task using semantic search identifiers, enabling efficient identification of both explicit and implicit concepts without requiring mention-level annotations during inference.

Authors:Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu
Title: Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
Abstract:
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
中文摘要:本研究针对强化学习中低概率词元梯度干扰问题,提出了优势重加权和Lopti两种方法,有效平衡不同概率词元的参数更新,使大语言模型在推理任务中的性能提升高达46.2%。
English Summary: This research addresses the imbalance in reinforcement learning for large language models by introducing Advantage Reweighting and Lopti methods, which reduce gradient interference from low-probability tokens and enhance learning from high-probability ones, achieving up to 46.2% improvement in reasoning tasks.

Authors:Han Deng, Yuan Meng, Shixiang Tang, Wanli Ouyang, Xinzhu Ma
Title: CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
Abstract:
Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem -- similar question retrieval -- to address this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code and Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate and Simplified-to-Full), built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. In addition, we develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Code and data are available at: https://github.com/coldchair/CPRet
中文: 本文提出CPRet竞赛编程检索基准套件,通过提供高质量数据和专用检索模型,解决重复编程问题影响竞赛公平性与模型评估有效性的问题。
English: This paper introduces CPRet, a benchmark suite for competitive programming retrieval tasks, addressing the issue of duplicate problems that undermine competition fairness and model evaluation validity by providing high-quality data and specialized models.

Authors:Shengsheng Lin, Haojun Chen, Haijie Wu, Chunyun Qiu, Weiwei Lin
Title: Temporal Query Network for Efficient Multivariate Time Series Forecasting
Abstract:
Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at: https://github.com/ACAT-SCUT/TQNet.
中文: 本文提出了一种名为时序查询(TQ)的新技术,通过可学习向量的周期性偏移来捕捉全局多变量相关性,并基于此构建了TQNet模型,在多个数据集上实现了顶尖的预测精度和高效性能。
English: The paper introduces Temporal Query (TQ), a technique that uses shifted learnable vectors in attention mechanisms to capture global multivariate correlations, leading to the development of TQNet, which achieves state-of-the-art forecasting accuracy and high efficiency across multiple datasets.

Authors:Kazuki Adachi, Shin'ya Yamaguchi, Tomoki Hamagami
Title: Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption
Abstract:
Pre-trained vision-language models such as contrastive language-image pre-training (CLIP) have demonstrated a remarkable generalizability, which has enabled a wide range of applications represented by zero-shot classification. However, vision-language models still suffer when they face datasets with large gaps from training ones, i.e., distribution shifts. We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise. Collecting a new dataset from a test distribution for fine-tuning highly costs since sensor degradation occurs unexpectedly and has a range of variety. Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data. Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts. Although these methods can adapt to domain shifts, such as fine-grained labels spaces or different renditions in input images, they fail to adapt to distribution shifts caused by sensor degradation. We found that this is because image embeddings are "corrupted" in terms of uniformity, a measure related to the amount of information. To make models robust to sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo). To address the corruption of image embeddings, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher. Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.
中文: 预训练的视觉语言模型如CLIP在传感器退化这类分布偏移中表现不佳,而提出的UnInfo方法通过均匀性感知的置信度最大化、信息平衡和知识蒸馏,有效提升了模型对传感器退化的鲁棒性。
English: Pre-trained vision-language models like CLIP struggle with sensor degradation, a type of distribution shift, but the proposed UnInfo method enhances robustness by preserving uniformity in image embeddings through confidence maximization and information balancing.

Authors:Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta
Title: HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos
Abstract:
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
中文: HiERO是一种弱监督方法,通过从未经编排的视频中提取人类活动的层次结构来增强推理能力,无需大量训练即在多项基准测试中达到领先水平。
English: HiERO is a weakly-supervised method that leverages hierarchical patterns in human activities from unscripted videos to enhance reasoning, achieving state-of-the-art performance across multiple benchmarks without extensive training.

Authors:Xiao Wang, Yu Jin, Lan Chen, Bo Jiang, Lin Zhu, Yonghong Tian, Jin Tang, Bin Luo
Title: Dynamic Graph Induced Contour-aware Heat Conduction Network for Event-based Object Detection
Abstract:
Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvDET.
中文: 本文提出CvHeat-DET动态图轮廓感知热传导网络,通过有效利用事件流固有轮廓信息并整合多尺度层次特征,显著提升了事件流目标检测性能,在三个基准数据集上得到充分验证。
English: This paper introduces CvHeat-DET, a novel dynamic graph-based contour-aware heat conduction network that enhances event stream object detection by leveraging inherent contour information and multi-scale hierarchical features, achieving superior performance across three benchmark datasets.

Authors:Shiao Wang, Xiao Wang, Liye Jin, Bo Jiang, Lin Zhu, Lan Chen, Yonghong Tian, Bin Luo
Title: Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach
Abstract:
Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker's performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on https://github.com/Event-AHU/SlowFast_Event_Track.
中文: 现有基于低帧率RGB相机的跟踪方法存在延迟高和资源受限问题,而提出的SFTrack采用仿生事件相机,通过图学习和FlashAttention骨干网络实现高精度慢速与高效快速的双模式跟踪,在FE240和COESOT等基准测试中验证了其有效性。
English: Existing tracking methods using low-frame-rate RGB cameras face latency and resource constraints, but the proposed SFTrack with bio-inspired event cameras introduces a dual-mode slow-fast paradigm that combines high-precision and efficient tracking through graph learning and FlashAttention backbones, validated on benchmarks like FE240 and COESOT.

Authors:Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
Title: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
Abstract:
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .
中文: TIME基准通过包含38,522个问答对的三个现实世界数据集,解决了大语言模型在密集时序信息、快速变化事件和复杂时序依赖等时序推理方面的不足,实验分析揭示了不同场景下的性能规律与规模扩展的影响。
English: The TIME benchmark addresses gaps in temporal reasoning for LLMs by providing 38,522 QA pairs across three real-world datasets to evaluate challenges like dense temporal data and complex event dynamics, with experimental analysis revealing performance trends and scaling effects.

Authors:Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
Title: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
Abstract:
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME , and the project page link is https://sylvain-wei.github.io/TIME/ .
中文: TIME基准通过包含38,522个问答对的三个现实世界数据集,解决了大语言模型在密集时序信息、快速变化事件和复杂时序依赖等时序推理方面的不足,实验分析揭示了不同场景下的性能规律与规模扩展的影响。
English: The TIME benchmark addresses gaps in temporal reasoning for LLMs by providing 38,522 QA pairs across three real-world datasets to evaluate challenges like dense temporal data and complex event dynamics, with experimental analysis revealing performance trends and scaling effects.

Authors:Junzhi Ning, Cheng Tang, Kaijing Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Huihui Xu, Yanzhou Su, Tianbin Li, Jiyao Liu, Jin Ye, Sheng Zhang, Yuanfeng Ji, Junjun He
Title: RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions
Abstract:
The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. Existing methods for synthesising Colour Fundus Photographs (CFPs) largely rely on predefined disease labels, which restricts their ability to generate images that reflect fine-grained anatomical variations, subtle disease stages, and diverse pathological features beyond coarse class categories. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, captioned retinal dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses the visual language model(VLM) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Building on this dataset, we employ a novel three-step training framework, RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Through extensive experiments, our method demonstrates superior performance across multiple datasets, with 62.07% of text-driven synthetic CFPs indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 5%-10% in diabetic retinopathy grading and glaucoma detection. Codes are available at https://github.com/uni-medical/retina-text2cfp.
中文: 本研究提出了RetinaLogos-1400k大规模标注视网膜数据集及创新训练框架,实现了对合成视网膜图像的精细语义控制,生成图像具有高度真实性,并显著提升了糖尿病视网膜病变和青光眼的诊断准确率。
English: This study introduces RetinaLogos-1400k, a large-scale captioned retinal dataset, and a novel training framework that enables fine-grained control over synthetic retinal images, achieving high realism and improving diagnostic accuracy for diabetic retinopathy and glaucoma.

Authors:Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus
Title: LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Abstract:
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce \textsc{LEXam}, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. We have open-sourced our code on \href{https://github.com/LEXam-Benchmark/LEXam}{GitHub} and released our data on \href{https://huggingface.co/datasets/LEXam-Benchmark/LEXam}{Hugging Face}. Project page: https://lexam-benchmark.github.io/
中文: LEXam基准基于340门法律考试的4886道题目,有效评估大语言模型在长篇法律推理中的表现,并通过可扩展的评估方法区分不同模型的能力差异。
English: The LEXam benchmark, derived from 340 law exams with 4,886 questions, challenges LLMs in long-form legal reasoning and effectively differentiates model capabilities through a scalable evaluation method.

Authors:Ben Liu, Zhen Qin
Title: Accelerate TarFlow Sampling with GS-Jacobi Iteration
Abstract:
Image generation models have achieved widespread applications. As an instance, the TarFlow model combines the transformer architecture with Normalizing Flow models, achieving state-of-the-art results on multiple benchmarks. However, due to the causal form of attention requiring sequential computation, TarFlow's sampling process is extremely slow. In this paper, we demonstrate that through a series of optimization strategies, TarFlow sampling can be greatly accelerated by using the Gauss-Seidel-Jacobi (abbreviated as GS-Jacobi) iteration method. Specifically, we find that blocks in the TarFlow model have varying importance: a small number of blocks play a major role in image generation tasks, while other blocks contribute relatively little; some blocks are sensitive to initial values and prone to numerical overflow, while others are relatively robust. Based on these two characteristics, we propose the Convergence Ranking Metric (CRM) and the Initial Guessing Metric (IGM): CRM is used to identify whether a TarFlow block is "simple" (converges in few iterations) or "tough" (requires more iterations); IGM is used to evaluate whether the initial value of the iteration is good. Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality. Code and checkpoints are accessible on https://github.com/encoreus/GS-Jacobi_for_TarFlow
中文:GS-Jacobi迭代法通过基于收敛性和初始值敏感度对模块进行优先级排序,显著加速了TarFlow的缓慢采样过程,在保持图像质量的同时实现了最高5.32倍的加速效果。
English: The GS-Jacobi iteration method accelerates TarFlow's slow sampling by prioritizing blocks based on their convergence and initial value sensitivity, achieving up to 5.32x speedup without quality loss.

Authors:Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Zhuosheng Zhang
Title: GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents
Abstract:
Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70\% over the best-performing baseline while only increasing training time by 4.9\% and testing time by 6.5\%. We also experimentally demonstrate that GEM can improve the step-wise success rate by 9.40\% by requesting assistance from the cloud model when encountering OOD samples. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at https://github.com/Wuzheng02/GEM-OODforGUIagents.
中文: 图形用户界面(GUI)代理在处理超出分布(OOD)指令时易出现任务失败或安全威胁,而提出的GEM方法通过高斯混合模型有效提升检测精度和步骤成功率,适用于多种设备环境。
English: GUI agents face challenges with out-of-distribution (OOD) instructions, leading to task failures or security risks, but the proposed GEM method significantly improves OOD detection accuracy and step-wise success rates across various devices.

Authors:Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, Qing Gu
Title: Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering
Abstract:
Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) method that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs. Our code will be released at https://github.com/zifengcheng/CP.
中文: 提出的对比提示方法通过引入辅助提示,使大语言模型在生成句子嵌入时聚焦核心语义,无需微调即可提升多项任务性能。
English: The proposed Contrastive Prompting (CP) method enhances sentence embeddings from LLMs by using an auxiliary prompt to focus on core semantics, improving performance across various tasks without fine-tuning.

Authors:Hulin Li
Title: Rethinking Features-Fused-Pyramid-Neck for Object Detection
Abstract:
Multi-head detectors typically employ a features-fused-pyramid-neck for multi-scale detection and are widely adopted in the industry. However, this approach faces feature misalignment when representations from different hierarchical levels of the feature pyramid are forcibly fused point-to-point. To address this issue, we designed an independent hierarchy pyramid (IHP) architecture to evaluate the effectiveness of the features-unfused-pyramid-neck for multi-head detectors. Subsequently, we introduced soft nearest neighbor interpolation (SNI) with a weight downscaling factor to mitigate the impact of feature fusion at different hierarchies while preserving key textures. Furthermore, we present a features adaptive selection method for down sampling in extended spatial windows (ESD) to retain spatial features and enhance lightweight convolutional techniques (GSConvE). These advancements culminate in our secondary features alignment solution (SA) for real-time detection, achieving state-of-the-art results on Pascal VOC and MS COCO. Code will be released at https://github.com/AlanLi1997/rethinking-fpn. This paper has been accepted by ECCV2024 and published on Springer Nature.
Chinese Summary: 本研究提出了一种二次特征对齐解决方案(SA),采用独立层次金字塔(IHP)和软最近邻插值(SNI)技术,有效解决了多头部检测器中特征错位问题,在Pascal VOC和MS COCO数据集上实现了最先进的实时检测性能。
English Summary: The study introduces a secondary features alignment solution (SA) that employs an independent hierarchy pyramid (IHP) and soft nearest neighbor interpolation (SNI) to address feature misalignment in multi-head detectors, achieving state-of-the-art real-time detection performance on Pascal VOC and MS COCO datasets.

Authors:Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu
Title: Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models
Abstract:
The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.
Chinese Summary: 去中心化竞技场(dearena)是一个完全自动化的框架,利用所有大语言模型的集体智慧进行相互评估,通过高效的排名算法和问题选择策略,在显著降低成本的同时,实现了与人类判断高达97%的相关性。
English Summary: The Decentralized Arena (dearena) is a fully automated framework that leverages collective intelligence from all large language models (LLMs) to evaluate each other democratically, achieving up to 97% correlation with human judgments while significantly reducing costs through efficient ranking algorithms and question selection strategies.

Authors:Yiling Tao, Shuyi Wang, Jiaxi Yang, Guido Zuccon
Title: Unlearning for Federated Online Learning to Rank: A Reproducibility Study
Abstract:
This paper reports on findings from a comparative study on the effectiveness and efficiency of federated unlearning strategies within Federated Online Learning to Rank (FOLTR), with specific attention to systematically analysing the unlearning capabilities of methods in a verifiable manner. Federated approaches to ranking of search results have recently garnered attention to address users privacy concerns. In FOLTR, privacy is safeguarded by collaboratively training ranking models across decentralized data sources, preserving individual user data while optimizing search results based on implicit feedback, such as clicks. Recent legislation introduced across numerous countries is establishing the so called "the right to be forgotten", according to which services based on machine learning models like those in FOLTR should provide capabilities that allow users to remove their own data from those used to train models. This has sparked the development of unlearning methods, along with evaluation practices to measure whether unlearning of a user data successfully occurred. Current evaluation practices are however often controversial, necessitating the use of multiple metrics for a more comprehensive assessment -- but previous proposals of unlearning methods only used single evaluation metrics. This paper addresses this limitation: our study rigorously assesses the effectiveness of unlearning strategies in managing both under-unlearning and over-unlearning scenarios using adapted, and newly proposed evaluation metrics. Thanks to our detailed analysis, we uncover the strengths and limitations of five unlearning strategies, offering valuable insights into optimizing federated unlearning to balance data privacy and system performance within FOLTR. We publicly release our code and complete results at https://github.com/Iris1026/Unlearning-for-FOLTR.git.
中文: 本研究通过多种评估指标对联邦在线排序学习中的联邦遗忘策略进行系统评估,揭示了五种方法在平衡数据隐私与系统性能方面的优缺点,弥补了现有评估方法的不足。
English: This study evaluates federated unlearning strategies in Federated Online Learning to Rank using multiple metrics to address limitations in existing evaluation methods, revealing the strengths and weaknesses of five approaches for balancing data privacy and system performance.

Authors:Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu
Title: A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
Abstract:
Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.
Chinese: Low-Rank Clone (LRC)方法通过低秩投影实现软剪枝和激活克隆,高效训练小型语言模型,仅用200亿标记就达到顶尖性能,训练效率提升千倍以上。
English: The Low-Rank Clone (LRC) method efficiently trains Small Language Models by using low-rank projections to achieve soft pruning and activation cloning from teacher models, achieving state-of-the-art performance with only 20 billion tokens and over 1,000x training efficiency.

Authors:Haibin He, Maoyuan Ye, Jing Zhang, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao
Title: Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
Abstract:
Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.
Chinese: Reasoning-OCR基准测试旨在评估大型多模态模型利用OCR提取的视觉文本线索处理复杂逻辑推理的能力,揭示了当前模型的不足,以期推动该领域未来的发展。
English: The Reasoning-OCR benchmark is introduced to evaluate Large Multimodal Models' ability to handle complex logical reasoning using OCR-extracted visual-text cues, revealing their current limitations and aiming to inspire future advancements in this area.

Authors:Hangyu Li, Qin Zhao, Haoran Xu, Xinyu Jiang, Qingwei Ben, Feiyu Jia, Haoyu Zhao, Liang Xu, Jia Zeng, Hanqing Wang, Bo Dai, Junting Dong, Jiangmiao Pang
Title: TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation
Abstract:
Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation. Codes is now available at https://github.com/cyjdlhy/TeleOpBench .
中文: TeleOpBench被提出作为一个统一的模拟器基准,用于评估四种双手灵巧遥操作方式在30项任务中的表现,展示了仿真与现实之间的强相关性,并为未来研究提供了标准化平台。
English: TeleOpBench is introduced as a unified simulator benchmark for evaluating four bimanual dexterous teleoperation modalities across 30 tasks, demonstrating strong simulation-to-reality correlation and providing a standardized platform for future research.

Authors:Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, Xu Yang
Title: FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks
Abstract:
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text -- an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose \textbf{FLASH} (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to \textbf{2.68$\times$} speed-up on video captioning and \textbf{2.55$\times$} on visual instruction tuning tasks compared to the original LMM. Our code is available \href{https://github.com/ZihuaEvan/FlashSD/}{[here]}.
中文: FLASH框架通过潜在感知令牌压缩机制和半自回归解码策略,有效解决视觉令牌冗余和共现模式问题,在保持输出质量的同时实现最高2.68倍的推理加速。
English: The FLASH framework accelerates multimodal model inference by introducing a latent-aware token compression mechanism and semi-autoregressive decoding to address visual token redundancy and co-occurrence patterns, achieving up to 2.68× speed-up while maintaining output quality.

Authors:Han Meng, Yancan Chen, Yunan Li, Yitian Yang, Jungup Lee, Renwen Zhang, Yi-Chieh Lee
Title: What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma
Abstract:
Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma. Our corpus is openly available at https://github.com/HanMeng2004/Mental-Health-Stigma-Interview-Corpus.
中文摘要:本研究提供了一个专家标注的人机对话访谈语料库,旨在解决心理健康污名检测中缺乏理论指导数据的问题,通过基准测试评估了先进神经模型并揭示了检测难点。
English Summary: This study introduces an expert-annotated corpus of human-chatbot interviews to address the lack of theory-informed data for training neural models in detecting mental-health stigma, benchmarking state-of-the-art models and highlighting detection challenges.

Authors:Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Ngai Wong, Yujiu Yang
Title: Shadow-FT: Tuning Instruct via Base
Abstract:
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.
中文摘要:Shadow-FT框架通过微调基础模型并直接移植权重更新至指令调优模型,无需额外参数即可在多项基准测试中显著提升性能。
English Summary: The Shadow-FT framework enhances instruction-tuned models by fine-tuning their base counterparts and transferring the learned weight updates, achieving superior performance without additional parameters across various benchmarks.

Authors:Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Yik-Chung Wu, Ngai Wong, Yujiu Yang
Title: Shadow-FT: Tuning Instruct Model via Training on Paired Base Model
Abstract:
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the Instruct (i.e., instruction-tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired Base models, the foundation for these Instruct variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). The Base model tends to be a good learner yet a weak backbone without post-training. Therefore, we propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models. The key insight is to fine-tune the Base model, and then \textit{directly} graft the learned weight updates to the Instruct model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization~(DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.
中文摘要:Shadow-FT框架通过微调基础模型并直接移植权重更新至指令调优模型,无需额外参数即可在多项基准测试中显著提升性能。
English Summary: The Shadow-FT framework enhances instruction-tuned models by fine-tuning their base counterparts and transferring the learned weight updates, achieving superior performance without additional parameters across various benchmarks.

Authors:Yinzhe Wang, Yiwen Xiao, Hu Wang, Yiping Xu, Yan Tian
Title: IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo
Abstract:
Multi-view stereo (MVS) models based on progressive depth hypothesis narrowing have made remarkable advancements. However, existing methods haven't fully utilized the potential that the depth coverage of individual instances is smaller than that of the entire scene, which restricts further improvements in depth estimation precision. Moreover, inevitable deviations in the initial stage accumulate as the process advances. In this paper, we propose Instance-Adaptive MVS (IA-MVS). It enhances the precision of depth estimation by narrowing the depth hypothesis range and conducting refinement on each instance. Additionally, a filtering mechanism based on intra-instance depth continuity priors is incorporated to boost robustness. Furthermore, recognizing that existing confidence estimation can degrade IA-MVS performance on point clouds. We have developed a detailed mathematical model for confidence estimation based on conditional probability. The proposed method can be widely applied in models based on MVSNet without imposing extra training burdens. Our method achieves state-of-the-art performance on the DTU benchmark. The source code is available at https://github.com/KevinWang73106/IA-MVS.
中文: 提出的实例自适应多视角立体方法通过针对每个实例缩小深度假设范围并引入深度连续性先验,在不增加训练负担的情况下,在DTU基准测试中取得了最先进的深度估计精度。
English: The proposed Instance-Adaptive MVS method improves depth estimation precision by narrowing depth hypotheses per instance and incorporating depth continuity priors, achieving state-of-the-art results on the DTU benchmark without additional training burdens.

Authors:Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, Linxi Fan
Title: DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Abstract:
We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.
中文: DreamGen采用四阶段流程,通过视频世界模型生成的神经轨迹训练机器人策略,使人形机器人仅需单一任务数据即可泛化至22种新行为和多类环境。
English: DreamGen is a four-stage pipeline that uses neural trajectories from video world models to train robot policies, enabling a humanoid robot to generalize across 22 new behaviors and environments with minimal initial data.

Authors:Jiabin Chen, Haiping Wang, Jinpeng Li, Yuan Liu, Zhen Dong, Bisheng Yang
Title: SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence
Abstract:
We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at https://github.com/WHU-USI3DV/SpatialLLM.
中文摘要:SpatialLLM提出了一种统一语言模型,无需训练或专家干预即可直接处理复杂城市场景中的空间智能任务,通过结构化场景描述驱动预训练大模型实现零样本城市分析。
English Summary: SpatialLLM introduces a unified language model that enables zero-shot spatial intelligence tasks in urban environments without training or expert input, leveraging structured scene descriptions to empower pre-trained LLMs for advanced urban analysis.

Authors:Chaofan Li, Jianlyu Chen, Yingxia Shao, Defu Lian, Zheng Liu
Title: Towards A Generalist Code Embedding Model Based On Massive Data Synthesis
Abstract:
Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf{CodeR} (\underline{Code} \underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significantly outperforms existing baselines and exhibits strong out-of-domain generalization performance. We have publicly released our code and the well-trained model to facilitate further research in this critical area. https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder.
Chinese: CodeR 是一种先进的代码嵌入模型,通过大规模合成数据集和课程学习策略,在代码检索任务中表现卓越,显著超越了现有基准模型。
English: CodeR is a state-of-the-art code embedding model that leverages a large-scale synthetic dataset and a curriculum learning strategy to achieve superior performance in code retrieval tasks, significantly outperforming existing baselines.

Authors:Kian Kai Ang, Guy Farrelly, Cheryl Pope, Damith C. Ranasinghe
Title: An Automated Blackbox Noncompliance Checker for QUIC Server Implementations
Abstract:
We develop QUICtester, an automated approach for uncovering non-compliant behaviors in the ratified QUIC protocol implementations (RFC 9000/9001). QUICtester leverages active automata learning to abstract the behavior of a QUIC implementation into a finite state machine (FSM) representation. Unlike prior noncompliance checking methods, to help uncover state dependencies on event timing, QUICtester introduces the idea of state learning with event timing variations, adopting both valid and invalid input configurations, and combinations of security and transport layer parameters during learning. We use pairwise differential analysis of learned behaviour models of tested QUIC implementations to identify non-compliance instances as behaviour deviations in a property-agnostic way. This exploits the existence of the many different QUIC implementations, removing the need for validated, formal models. The diverse implementations act as cross-checking test oracles to discover non-compliance. We used QUICtester to analyze analyze 186 learned models from 19 QUIC implementations under the five security settings and discovered 55 implementation errors. Significantly, the tool uncovered a QUIC specification ambiguity resulting in an easily exploitable DoS vulnerability, led to 5 CVE assignments from developers, and two bug bounties thus far.
中文: QUICtester是一种自动化方法,通过主动自动机学习将QUIC实现抽象为有限状态机,并利用不同实现间的行为差异进行无属性依赖的差分分析,从而发现协议违规行为和漏洞,无需依赖已验证的正式模型。
English: QUICtester is an automated method that uses active automata learning to detect non-compliance in QUIC implementations by creating finite state machines and analyzing behavioral deviations across different implementations, uncovering errors and vulnerabilities without relying on formal models.

Authors:Mingyuan Zhou, Yi Gu, Zhendong Wang
Title: Few-Step Diffusion via Score identity Distillation
Abstract:
Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at https://github.com/mingyuanzhou/SiD-LSG.
中文摘要:SiD提出了一种无需真实数据、支持少步生成的蒸馏框架,通过创新性引导策略解决了高分辨率文生图模型中的对齐与多样性权衡问题,在不依赖真实图像的情况下实现了顶尖性能。
English Summary: SiD introduces a data-free, few-step distillation framework that overcomes the alignment-diversity trade-off in high-resolution text-to-image generation by employing novel guidance strategies, achieving state-of-the-art performance without relying on real images.

Authors:Hanzhuo Tan, Xiaolong Tian, Hanrui Qi, Jiaming Liu, Zuchen Gao, Siyi Wang, Qi Luo, Jing Li, Yuqun Zhang
Title: Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation
Abstract:
Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/albertan017/LLM4Decompile
中文摘要:近期基于大语言模型的逆向编译器能有效将低级二进制代码转换为可读源码,但缺乏大规模二进制-源代码基准测试阻碍了技术发展,为此我们推出首个开源数据集Decompile-Bench,包含两百万函数对,使重执行率提升20%。
English Summary: Recent LLM-based decompilers effectively convert binaries to readable code, but the lack of large-scale binary-source benchmarks hinders progress, prompting the introduction of Decompile-Bench, an open-source dataset with two million function pairs that improves re-executability by 20%.

Authors:Zihan Su, Xuerui Qiu, Hongbin Xu, Tangyu Jiang, Junhao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu
Title: Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
Abstract:
The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. Code is publicly available at https://github.com/Sugewud/Safe-Sora
中文摘要:Safe-Sora是首个通过分层自适应匹配机制和新型三维小波增强Mamba架构,直接在AI生成视频中嵌入隐形图形水印的框架,在视频质量和版权保护方面实现了最先进的性能。
English Summary: Safe-Sora is the first framework to embed invisible graphical watermarks directly into AI-generated videos through a hierarchical adaptive matching mechanism and a novel 3D wavelet-enhanced Mamba architecture, achieving state-of-the-art performance in video quality and copyright protection.

Authors:Yaotian Yang, Yiwen Tang, Yizhe Chen, Xiao Chen, Jiangjie Qiu, Hao Xiong, Haoyu Yin, Zhiyao Luo, Yifei Zhang, Sijia Tao, Wentao Li, Qinghua Zhang, Yuqiang Li, Wanli Ouyang, Bin Zhao, Xiaonan Wang, Fei Wei
Title: AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use
Abstract:
Machine learning-based interatomic potentials and force fields depend critically on accurate atomic structures, yet such data are scarce due to the limited availability of experimentally resolved crystals. Although atomic-resolution electron microscopy offers a potential source of structural data, converting these images into simulation-ready formats remains labor-intensive and error-prone, creating a bottleneck for model training and validation. We introduce AutoMat, an end-to-end, agent-assisted pipeline that automatically transforms scanning transmission electron microscopy (STEM) images into atomic crystal structures and predicts their physical properties. AutoMat combines pattern-adaptive denoising, physics-guided template retrieval, symmetry-aware atomic reconstruction, fast relaxation and property prediction via MatterSim, and coordinated orchestration across all stages. We propose the first dedicated STEM2Mat-Bench for this task and evaluate performance using lattice RMSD, formation energy MAE, and structure-matching success rate. By orchestrating external tool calls, AutoMat enables a text-only LLM to outperform vision-language models in this domain, achieving closed-loop reasoning throughout the pipeline. In large-scale experiments over 450 structure samples, AutoMat substantially outperforms existing multimodal large language models and tools. These results validate both AutoMat and STEM2Mat-Bench, marking a key step toward bridging microscopy and atomistic simulation in materials science.The code and dataset are publicly available at https://github.com/yyt-2378/AutoMat and https://huggingface.co/datasets/yaotianvector/STEM2Mat.
中文:AutoMat是一种端到端的自动化流程,能将扫描透射电子显微镜图像转化为原子晶体结构并预测其物理性质,显著优于现有方法,为材料科学中显微技术与原子模拟搭建了桥梁。
English: AutoMat is an automated pipeline that converts scanning transmission electron microscopy images into atomic crystal structures and predicts their physical properties, significantly outperforming existing methods and bridging microscopy with atomistic simulation in materials science.

Authors:Xiangpeng Tian, Xiangyu Liao, Xiao Liu, Meng Li, Chao Ren
Title: Degradation-Aware Feature Perturbation for All-in-One Image Restoration
Abstract:
All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at https://github.com/TxpHome/DFPIR.
Chinese: 本文提出了DFPIR,一种新型一体化图像恢复模型,通过引入退化感知特征扰动来调整特征空间以适配统一参数空间,在多项图像恢复任务中实现了最先进的性能。
English: The paper introduces DFPIR, a novel all-in-one image restoration model that uses degradation-aware feature perturbations to align diverse degradation types with a unified parameter space, achieving state-of-the-art performance across multiple tasks.

Authors:Wanfu Gao, Zengyao Man, Hanlin Pan, Kunpeng Liu
Title: Dual-Agent Reinforcement Learning for Automated Feature Generation
Abstract:
Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective. The code is available at https://github.com/extess0/DARL.
中文摘要:该摘要提出了一种双智能体强化学习方法,通过自注意力机制增强状态表示,并针对离散和连续特征采用不同操作,有效解决了现有方法中特征冗余和关系捕捉不足的问题。
English Summary: This abstract introduces a dual-agent reinforcement learning method that generates and selects features using a self-attention mechanism and distinct operations for discrete and continuous features, effectively addressing redundancy and relationship capture issues in current approaches.

Authors:Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li
Title: Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
Abstract:
Conversational Speech Synthesis (CSS) aims to align synthesized speech with the emotional and stylistic context of user-agent interactions to achieve empathy. Current generative CSS models face interpretability limitations due to insufficient emotional perception and redundant discrete speech coding. To address the above issues, we present Chain-Talker, a three-stage framework mimicking human cognition: Emotion Understanding derives context-aware emotion descriptors from dialogue history; Semantic Understanding generates compact semantic codes via serialized prediction; and Empathetic Rendering synthesizes expressive speech by integrating both components. To support emotion modeling, we develop CSS-EmCap, an LLM-driven automated pipeline for generating precise conversational speech emotion captions. Experiments on three benchmark datasets demonstrate that Chain-Talker produces more expressive and empathetic speech than existing methods, with CSS-EmCap contributing to reliable emotion modeling. The code and demos are available at: https://github.com/AI-S2-Lab/Chain-Talker.
中文摘要:Chain-Talker是一个三阶段对话语音合成框架,通过情感理解、语义压缩和共情渲染来提升情感契合度与可解释性,其创新的CSS-EmCap情感建模方法在实验中显著优于现有技术。
English Summary: Chain-Talker is a three-stage conversational speech synthesis framework that enhances emotional alignment and interpretability through emotion understanding, semantic compression, and empathetic rendering, outperforming existing methods with its novel CSS-EmCap emotion modeling pipeline.

Authors:Sanggeon Yun, Ryozo Masukawa, Hyunwoo Oh, Nathaniel D. Bastian, Mohsen Imani
Title: A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection
Abstract:
Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, heavy augmentations, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations typically induce large representation shifts in a small subset of layers. Building on this, we propose two complementary strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead and no compromise to clean accuracy. The code is available here: https://github.com/c0510gy/AFLS-AED.
Chinese: 本文提出了一种轻量级的即插即用检测框架,通过仅使用良性数据测量深度神经网络内部层间不一致性来识别对抗样本,以极低计算成本实现了最先进的检测性能。
English: This paper introduces a lightweight, plug-in detection framework that identifies adversarial examples by measuring internal layer-wise inconsistencies in deep neural networks using only benign data, achieving state-of-the-art performance with minimal computational cost.

Authors:Sanggeon Yun, Ryozo Masukawa, Hyunwoo Oh, Nathaniel D. Bastian, Mohsen Imani
Title: A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection
Abstract:
Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations induce large, localized violations of layer-wise Lipschitz continuity in a small subset of layers. Building on this, we propose two complementary strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to empirically measure these violations and expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead. Furthermore, our system-level analysis provides a practical method for selecting a detection threshold with a formal lower-bound guarantee on accuracy. The code is available here: https://github.com/c0510gy/AFLS-AED.
Chinese: 本文提出了一种轻量级的即插即用检测框架,通过仅使用良性数据测量深度神经网络内部层间不一致性来识别对抗样本,以极低计算成本实现了最先进的检测性能。
English: This paper introduces a lightweight, plug-in detection framework that identifies adversarial examples by measuring internal layer-wise inconsistencies in deep neural networks using only benign data, achieving state-of-the-art performance with minimal computational cost.

Authors:Florent Chiaroni, Ali Ayub, Ola Ahmad
Title: ProMi: An Efficient Prototype-Mixture Baseline for Few-Shot Segmentation with Bounding-Box Annotations
Abstract:
In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.
Chinese: 本文提出了一种名为ProMi的新型小样本二元分割方法,它采用边界框标注而非像素级标签,在多个数据集上取得了最佳效果,并展示了在机器人实际应用中的有效性。
English: This paper introduces ProMi, a novel few-shot binary segmentation method that uses bounding-box annotations instead of pixel-level labels, achieving superior results across datasets and demonstrating real-world applicability in robotics.

Authors:Botao Amber Hu, Rem Rungu Lin, Yilan Elan Tao, Samuli Laato, Yue Li
Title: Towards Immersive Mixed Reality Street Play: Understanding Co-located Bodily Play with See-through Head-mounted Displays in Public Spaces
Abstract:
As see-through Mixed Reality Head-Mounted Displays (MRHMDs) proliferate, their usage is gradually shifting from controlled, private settings to spontaneous, public contexts. While location-based augmented reality mobile games such as Pokemon GO have been successful, the embodied interaction afforded by MRHMDs moves play beyond phone-based screen-tapping toward co-located, bodily, movement-based play. In anticipation of widespread MRHMD adoption, major technology companies have teased concept videos envisioning urban streets as vast mixed reality playgrounds-imagine Harry Potter-style wizard duels in city streets-which we term Immersive Mixed Reality Street Play (IMRSP). However, few real-world studies examine such scenarios. Through empirical, in-the-wild studies of our research-through-design game probe, Multiplayer Omnipresent Fighting Arena (MOFA), deployed across diverse public venues, we offer initial insights into the social implications, challenges, opportunities, and design recommendations of IMRSP. The MOFA framework, which includes three gameplay modes-"The Training," "The Duel," and "The Dragon"-is open-sourced at https://github.com/realitydeslab/mofa.
中文摘要:混合现实头戴设备正从私人场景转向公共应用,通过MOFA游戏在真实环境中的实证研究,揭示了沉浸式街头游戏的社会影响与设计挑战,并开源了游戏框架。
English Summary: Mixed Reality Head-Mounted Displays are transitioning from private to public use, enabling immersive street play like wizard duels, with the MOFA game study providing initial insights into its social impacts and design considerations.

Authors:Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang
Title: CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
Abstract:
Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.
Chinese: 本文提出CPGD算法,通过基于KL散度的策略漂移约束和比率对数裁剪机制,有效稳定语言模型的强化学习训练过程,在保持训练稳定性的同时显著提升性能表现。
English: This paper introduces CPGD, a novel reinforcement learning algorithm that stabilizes policy training in language models by dynamically constraining policy drift with KL divergence and preventing excessive updates through a clipping mechanism, outperforming prior methods in both stability and performance.

Authors:Jingyue Gao, Runji Lin, Keming Lu, Bowen Yu, Junyang Lin, Jianyu Chen
Title: MARGE: Improving Math Reasoning for LLMs with Guided Exploration
Abstract:
Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduce \textbf{MARGE}: Improving \textbf{Ma}th \textbf{R}easoning with \textbf{G}uided \textbf{E}xploration, a novel method to address this issue and enhance mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Through extensive experiments across multiple backbone models and benchmarks, we demonstrate that MARGE significantly improves reasoning capabilities without requiring external annotations or training additional value models. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data. Our code and models are available at \href{https://github.com/georgao35/MARGE}{this link}.
中文: MARGE通过引导式探索,系统分析自生成解决方案的中间推理状态,无需外部标注即可提升大语言模型的数学推理准确性和多样性。
English: MARGE introduces guided exploration to enhance mathematical reasoning in LLMs by systematically exploring intermediate states from self-generated solutions, improving accuracy and diversity without external annotations.

Authors:Longxi Gao, Li Zhang, Mengwei Xu
Title: UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Abstract:
Training effective Vision Language Models (VLMs) for GUI agents typically relies on supervised fine-tuning (SFT) over large-scale annotated datasets, where the collection process is labor-intensive and error-prone. In this work, we propose a self-supervised inverse dynamics task to enable VLMs to learn from GUI transition pairs by inferring the action that caused that transition. This training task offers two advantages: (1) It enables VLMs to ignore variations unrelated to user actions (e.g., background refreshes, ads) and to focus on true affordances such as buttons and input fields within complex GUIs. (2) The training data can be easily obtained from existing GUI trajectories without requiring human annotation, and it can be easily scaled through automatic offline exploration. Using this training task, we propose UI-shift, a framework for enhancing VLM-based GUI agents through self-supervised reinforcement learning (RL). With only 2K training samples sourced from existing datasets, two VLMs -- Qwen2.5-VL-3B and Qwen2.5-VL-7B -- trained with UI-Shift achieve competitive or superior performance on grounding tasks (ScreenSpot-series benchmarks) and GUI automation tasks (AndroidControl), compared to SFT baselines and GUI-specific models that explicitly elicit reasoning abilities during RL. Our findings suggest a potential direction for enhancing VLMs for GUI agents by leveraging more self-supervised training data in the future. Code, model, and data are available at: https://github.com/UbiquitousLearning/UIShift
中文: 本文提出UI-Shift自监督框架,通过让视觉语言模型从界面转换对中推断操作动作,无需人工标注即可实现高效训练,仅用少量样本就在多项任务中达到领先性能。
English: This paper introduces UI-Shift, a self-supervised framework that trains Vision Language Models to infer GUI actions from transition pairs, eliminating the need for manual annotation and achieving competitive performance with minimal training data.

Authors:Wenchen Chen, Yanmei Zhang, Zhongwei Xiao, Jianping Chu, Xingbo Wang
Title: Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification
Abstract:
Few-shot classification of hyperspectral images (HSI) faces the challenge of scarce labeled samples. Self-Supervised learning (SSL) and Few-Shot Learning (FSL) offer promising avenues to address this issue. However, existing methods often struggle to adapt to the spatial geometric diversity of HSIs and lack sufficient spectral prior knowledge. To tackle these challenges, we propose a method, Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification (S4L-FSC), aimed at improving the performance of few-shot HSI classification. Specifically, we first leverage heterogeneous datasets to pretrain a spatial feature extractor using a designed Rotation-Mirror Self-Supervised Learning (RM-SSL) method, combined with FSL. This approach enables the model to learn the spatial geometric diversity of HSIs using rotation and mirroring labels as supervisory signals, while acquiring transferable spatial meta-knowledge through few-shot learning. Subsequently, homogeneous datasets are utilized to pretrain a spectral feature extractor via a combination of FSL and Masked Reconstruction Self-Supervised Learning (MR-SSL). The model learns to reconstruct original spectral information from randomly masked spectral vectors, inferring spectral dependencies. In parallel, FSL guides the model to extract pixel-level discriminative features, thereby embedding rich spectral priors into the model. This spectral-spatial pretraining method, along with the integration of knowledge from heterogeneous and homogeneous sources, significantly enhances model performance. Extensive experiments on four HSI datasets demonstrate the effectiveness and superiority of the proposed S4L-FSC approach for few-shot HSI classification.
中文: 提出的S4L-FSC方法通过结合自监督学习和少样本学习,分别预训练空间与光谱特征提取器,有效解决了高光谱图像空间几何多样性适应问题并融入了光谱先验知识,显著提升了少样本分类性能。
English: The proposed S4L-FSC method enhances few-shot hyperspectral image classification by combining self-supervised learning and few-shot learning to pretrain spatial and spectral feature extractors, effectively addressing spatial geometric diversity and incorporating spectral prior knowledge through heterogeneous and homogeneous datasets.

Authors:Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou
Title: VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
Abstract:
Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks.
中文: VIDEORFT将强化微调扩展至多模态大语言模型,通过构建高质量思维链数据集并引入语义一致性奖励机制,有效解决了视频推理的难题,在六大基准测试中实现了最优性能。
English: VIDEORFT extends reinforcement fine-tuning to multimodal large language models, addressing video reasoning challenges by generating high-quality chain-of-thought datasets and introducing a semantic-consistency reward, achieving state-of-the-art performance across six benchmarks.

Authors:Zirun Guo, Minjie Hong, Tao Jin
Title: Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
Abstract:
Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.
中文: 本文提出Observe-R1框架,通过渐进式学习和多模态约束增强大语言模型的多模态推理能力,在多项基准测试中实现更优性能。
English: This paper introduces Observe-R1, a framework that enhances multimodal reasoning in large language models through progressive learning and multimodal constraints, achieving superior performance on reasoning benchmarks.

Authors:Siwei Xia, Li Sun, Tiantian Sun, Qingli Li
Title: DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model
Abstract:
Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at: https://github.com/Sylvie-X/DragLoRA.
中文摘要:DragLoRA提出了一种新颖框架,通过集成LoRA适配器并结合改进的去噪分数蒸馏损失与自适应优化方案,显著提升了基于拖拽的图像编辑的精确度和计算效率。
English Summary: DragLoRA introduces a novel framework that integrates LoRA adapters with an enhanced denoising score distillation loss and adaptive optimization, significantly improving both precision and efficiency in drag-based image editing.

Authors:Wenqiao Zhu, Chao Xu, Lulu Wang, Jun Wu
Title: PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
Abstract:
Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at https://github.com/WNQzhu/PSC.
中文: PSC(相位偏移校准)是一种新颖模块,通过校准预定义的RoPE频率来增强现有扩展大语言模型上下文窗口的方法,在多种模型和任务中提升了性能和鲁棒性。
English: PSC (Phase Shift Calibration) is a novel module that enhances existing methods for extending the context window in large language models by calibrating predefined RoPE frequencies, improving performance and robustness across various models and tasks.

Authors:Emanuele La Malfa, Jon Vadillo, Marco Molinari, Michael Wooldridge
Title: Fixed Point Explainability
Abstract:
This paper introduces a formal notion of fixed point explanations, inspired by the "why regress" principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.
Chinese Summary: 本文提出固定点解释的形式化概念,用以评估模型与解释器之间相互作用的稳定性,为多种解释器类型建立了收敛条件并展示了实证结果。
English Summary: This paper proposes a formal concept of fixed point explanations to evaluate the interaction stability between models and explainers, establishing convergence criteria for various explainer types and presenting empirical findings.

Authors:Yang Hu, Xingyu Zhang, Xueji Fang, Zhiyang Chen, Xiao Wang, Huatian Zhang, Guojun Qi
Title: SLOT: Sample-specific Language Model Optimization at Test-time
Abstract:
We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.
中文: SLOT是一种参数高效的测试时优化方法,通过更新轻量级参数向量来提升语言模型对单个提示的响应准确性,在多个基准测试中实现了显著的性能提升。
English: SLOT is a parameter-efficient test-time optimization method that enhances language models' accuracy on individual prompts by updating a lightweight parameter vector, achieving significant performance gains across multiple benchmarks.

Authors:Elizaveta Pestova, Ilya Osokin, Danil Belov, Pavel Osinenko
Title: Adaptive MPC-based quadrupedal robot control under periodic disturbances
Abstract:
Recent advancements in adaptive control for reference trajectory tracking enable quadrupedal robots to perform locomotion tasks under challenging conditions. There are methods enabling the estimation of the external disturbances in terms of forces and torques. However, a specific case of disturbances that are periodic was not explicitly tackled in application to quadrupeds. This work is devoted to the estimation of the periodic disturbances with a lightweight regressor using simplified robot dynamics and extracting the disturbance properties in terms of the magnitude and frequency. Experimental evidence suggests performance improvement over the baseline static disturbance compensation. All source files, including simulation setups, code, and calculation scripts, are available on GitHub at https://github.com/aidagroup/quad-periodic-mpc.
中文: 本研究提出一种轻量级回归器,通过简化动力学模型提取周期性扰动的幅值与频率,从而在四足机器人中实现优于静态补偿方法的性能提升。
English: This study introduces a lightweight regressor to estimate periodic disturbances in quadrupedal robots, improving performance over static compensation methods by leveraging simplified dynamics to extract disturbance magnitude and frequency.

Authors:Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang
Title: GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy
Abstract:
Cost-aware Dynamic Workflow Scheduling (CADWS) is a key challenge in cloud computing, focusing on devising an effective scheduling policy to efficiently schedule dynamically arriving workflow tasks, represented as Directed Acyclic Graphs (DAG), to suitable virtual machines (VMs). Deep reinforcement learning (DRL) has been widely employed for automated scheduling policy design. However, the performance of DRL is heavily influenced by the design of the problem-tailored policy network and is highly sensitive to hyperparameters and the design of reward feedback. Considering the above-mentioned issues, this study proposes a novel DRL method combining Graph Attention Networks-based policy network and Evolution Strategy, referred to as GATES. The contributions of GATES are summarized as follows: (1) GATES can capture the impact of current task scheduling on subsequent tasks by learning the topological relationships between tasks in a DAG. (2) GATES can assess the importance of each VM to the ready task, enabling it to adapt to dynamically changing VM resources. (3) Utilizing Evolution Strategy's robustness, exploratory nature, and tolerance for delayed rewards, GATES achieves stable policy learning in CADWS. Extensive experimental results demonstrate the superiority of the proposed GATES in CADWS, outperforming several state-of-the-art algorithms. The source code is available at: https://github.com/YaShen998/GATES.
中文: 本研究提出名为GATES的新型深度强化学习方法,通过结合图注意力网络与进化策略,能有效捕捉工作流任务依赖关系并适应虚拟机资源动态变化,在云环境成本感知工作流调度中展现出优越性能。
English: This study introduces GATES, a novel deep reinforcement learning method that integrates Graph Attention Networks with Evolution Strategy to enhance cost-aware dynamic workflow scheduling in cloud computing by effectively capturing task dependencies and adapting to virtual machine resource changes.

Authors:Qizhou Chen, Dakan Wang, Taolin Zhang, Zaoming Yan, Chengsong You, Chengyu Wang, Xiaofeng He
Title: UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models
Abstract:
Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.
中文: UniEdit是一个基于开放领域知识的统一基准,通过涵盖广泛知识领域并采用创新的邻域多跳链采样算法来评估编辑的连锁效应,以提升大语言模型编辑的全面性和多样性。
English: UniEdit is a unified benchmark designed to improve large language model editing by providing comprehensive coverage across diverse open-domain knowledge and evaluating the ripple effects of edits through a novel sampling algorithm.

Authors:Xinye Li, Mingqi Wan, Dianbo Sui
Title: LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning
Abstract:
We present Team asdfo123's submission to the LLMSR@XLLM25 shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement-evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at https://github.com/asdfo123/LLMSR-asdfo123.
中文: asdfo123团队在LLMSR@XLLM25任务中,仅使用Meta-Llama-3-8B-Instruct模型通过多轮提示实现了无需微调的结构化推理,最终排名第五,与更复杂的系统性能相当。
English: Team asdfo123's submission for the LLMSR@XLLM25 task uses Meta-Llama-3-8B-Instruct with a multi-turn prompt to achieve competitive results in structural reasoning without fine-tuning or external resources, ranking 5th overall.

Authors:Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Title: LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Abstract:
Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at https://github.com/MiliLab/LogicOCR.
中文: LogicOCR是一个包含1,100道选择题的新基准,旨在评估大型多模态模型在文本丰富图像上的逻辑推理能力,揭示了尽管OCR技术有所进步,但它们在视觉阅读与推理结合方面仍存在不足。
English: LogicOCR is a new benchmark with 1,100 multiple-choice questions designed to assess large multimodal models' logical reasoning on text-rich images, revealing their limitations in integrating visual reading with reasoning despite advances in OCR.

Authors:Sijie Zhao, Feng Liu, Enzhuo Zhang, Yiqing Guo, Pengfeng Xiao, Lei Bai, Xueliang Zhang, Hao Chen
Title: Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction
Abstract:
The proliferation of multi-source remote sensing data has propelled the development of deep learning for dense prediction, yet significant challenges in data and task unification persist. Current deep learning architectures for remote sensing are fundamentally rigid. They are engineered for fixed input-output configurations, restricting their adaptability to the heterogeneous spatial, temporal, and spectral dimensions inherent in real-world data. Furthermore, these models neglect the intrinsic correlations among semantic segmentation, binary change detection, and semantic change detection, necessitating the development of distinct models or task-specific decoders. This paradigm is also constrained to a predefined set of output semantic classes, where any change to the classes requires costly retraining. To overcome these limitations, we introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling. STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands by leveraging their metadata for a unified representation. Moreover, STSUN unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings. Similarly, STSUN facilitates flexible prediction across multiple set of semantic categories by integrating trainable category embeddings as metadata. Extensive experiments on multiple datasets with diverse Spatial-Temporal-Spectral configurations in multiple scenarios demonstrate that a single STSUN model effectively adapts to heterogeneous inputs and outputs, unifying various dense prediction tasks and diverse semantic class predictions. The proposed approach consistently achieves state-of-the-art performance, highlighting its robustness and generalizability for complex remote sensing applications.
中文: 提出的时空谱统一网络(STSUN)通过可训练的嵌入机制适应异构遥感数据并统一多种密集预测任务,在多样化场景中均实现了最先进的性能表现。
English: The proposed Spatial-Temporal-Spectral Unified Network (STSUN) overcomes the limitations of rigid deep learning architectures by adapting to heterogeneous remote sensing data and unifying multiple dense prediction tasks through trainable embeddings, achieving state-of-the-art performance across diverse scenarios.

Authors:Md. Atiqur Rahman, Sabrina Islam, Mushfiqul Haque Omi
Title: LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark
Abstract:
Evaluating machine translation (MT) for low-resource languages poses a persistent challenge, primarily due to the limited availability of high quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference-free evaluation techniques; however, their effectiveness diminishes in the absence of dialect-specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM-based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect-specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect-guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation, along with improvements across other evaluation settings. The dataset and the code are available at https://github.com/180041123-Atiq/MTEonLowResourceLanguage.
中文: 本研究提出了一种方言引导的框架,通过整合方言特定数据和提示策略,增强了基于大语言模型的低资源语言机器翻译评估,在相关性指标上取得了显著提升。
English: This study introduces a dialect-guided framework that enhances LLM-based machine translation evaluation for low-resource languages by incorporating dialect-specific data and prompting strategies, achieving significant improvements in correlation metrics.

Authors:ZhanFeng Feng, Long Peng, Xin Di, Yong Guo, Wenbo Li, Yulun Zhang, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha
Title: PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement
Abstract:
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks.The code will be made publicly available at: https://github.com/xiaoBIGfeng/PMQ-VE.
中文: 摘要提出PMQ-VE这一新型量化方法,通过回溯式多帧量化和渐进式多教师蒸馏的两阶段策略,有效解决了现有视频增强量化方法在动态范围适应和模型学习能力方面的不足,在多个任务和基准测试中实现了最优性能。
English: The abstract introduces PMQ-VE, a novel quantization method that addresses the limitations of existing approaches in video enhancement by employing a two-stage process with backtracking-based multi-frame quantization and progressive multi-teacher distillation, achieving state-of-the-art performance while reducing computational demands.

Authors:Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Title: Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training
Abstract:
Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on https://github.com/UESTC-GQJ/TKRE.
中文: 提出的TKRE框架通过解释驱动的知识生成和两阶段预训练策略,将大语言模型与传统关系抽取技术相结合,有效解决数据稀缺问题并提升泛化能力,在小样本关系抽取任务中实现了最先进的性能。
English: The proposed TKRE framework integrates large language models with traditional relation extraction techniques through explanation-driven knowledge generation and a two-stage pre-training strategy, achieving state-of-the-art performance in Few-Shot Relation Extraction by effectively addressing data scarcity and enhancing generalization.

Authors:Yeonkyung Lee, Woojung Han, Youngjun Jun, Hyeonmin Kim, Jungkyung Cho, Seong Jae Hwang
Title: PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning
Abstract:
Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at https://github.com/MICV-yonsei/PRETI
中文: PRETI视网膜基础模型通过元数据感知学习和自适应掩码策略,有效结合患者信息与图像特征,在多种视网膜疾病诊断中实现了最优性能。
English: The PRETI retinal foundation model integrates metadata-aware learning with self-supervised representation techniques, employing learnable embeddings and adaptive masking to achieve state-of-the-art diagnostic performance across diverse retinal diseases.

Authors:Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng Zhang
Title: Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Abstract:
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup. The code is available at https://github.com/gszfwsb/Data-Whisperer.
Chinese: Data Whisperer 是一种高效、无需训练的基于注意力机制的少样本上下文学习方法,能够为大型语言模型微调选择最优数据子集,以更少的数据和更快的速度超越现有方法的性能。
English: Data Whisperer is an efficient, training-free method that uses attention-based, few-shot in-context learning to select optimal data subsets for fine-tuning LLMs, achieving superior performance with significantly less data and faster speeds than existing approaches.

Authors:Riad Hossain, Muhammad Ashad Kabir, Arat Ibne Golam Mowla, Animesh Chandra Roy, Ranjit Kumar Ghosh
Title: BenSParX: A Robust Explainable Machine Learning Framework for Parkinson's Disease Detection from Bengali Conversational Speech
Abstract:
Parkinson's disease (PD) poses a growing global health challenge, with Bangladesh experiencing a notable rise in PD-related mortality. Early detection of PD remains particularly challenging in resource-constrained settings, where voice-based analysis has emerged as a promising non-invasive and cost-effective alternative. However, existing studies predominantly focus on English or other major languages; notably, no voice dataset for PD exists for Bengali - posing a significant barrier to culturally inclusive and accessible healthcare solutions. Moreover, most prior studies employed only a narrow set of acoustic features, with limited or no hyperparameter tuning and feature selection strategies, and little attention to model explainability. This restricts the development of a robust and generalizable machine learning model. To address this gap, we present BenSparX, the first Bengali conversational speech dataset for PD detection, along with a robust and explainable machine learning framework tailored for early diagnosis. The proposed framework incorporates diverse acoustic feature categories, systematic feature selection methods, and state-of-the-art machine learning algorithms with extensive hyperparameter optimization. Furthermore, to enhance interpretability and trust in model predictions, the framework incorporates SHAP (SHapley Additive exPlanations) analysis to quantify the contribution of individual acoustic features toward PD detection. Our framework achieves state-of-the-art performance, yielding an accuracy of 95.77%, F1 score of 95.57%, and AUC-ROC of 0.982. We further externally validated our approach by applying the framework to existing PD datasets in other languages, where it consistently outperforms state-of-the-art approaches. To facilitate further research and reproducibility, the dataset has been made publicly available at https://github.com/Riad071/BenSParX.
中文: 本研究推出了首个用于帕金森病检测的孟加拉语语音数据集BenSparX,并提出了一种鲁棒的机器学习框架,通过全面的特征工程和可解释AI技术实现了最优性能。
English: This study introduces BenSparX, the first Bengali speech dataset for Parkinson's disease detection, and presents a robust machine learning framework that achieves state-of-the-art performance through comprehensive feature engineering and explainable AI techniques.

Authors:Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero
Title: Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum
Abstract:
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($σ=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
Chinese: 本研究提出了一种自监督框架,通过采用基于去噪器训练的课程学习和教师引导正则化,增强了噪声鲁棒性表征学习,在无需推理阶段去噪器的情况下显著提升了模型在噪声数据上的性能。
English: This study introduces a self-supervised framework that enhances noise-robust representation learning by employing a denoiser-trained curriculum and teacher-guided regularization, eliminating the need for denoisers during inference while significantly improving model performance on noisy data.

Authors:Hanyu Wang, Xinrui Wu, Zijian Ding, Su Zheng, Chengyue Wang, Tony Nowatzki, Yizhou Sun, Jason Cong
Title: LLM-DSE: Searching Accelerator Parameters with LLM Agents
Abstract:
Even though high-level synthesis (HLS) tools mitigate the challenges of programming domain-specific accelerators (DSAs) by raising the abstraction level, optimizing hardware directive parameters remains a significant hurdle. Existing heuristic and learning-based methods struggle with adaptability and sample efficiency. We present LLM-DSE, a multi-agent framework designed specifically for optimizing HLS directives. Combining LLM with design space exploration (DSE), our explorer coordinates four agents: Router, Specialists, Arbitrator, and Critic. These multi-agent components interact with various tools to accelerate the optimization process. LLM-DSE leverages essential domain knowledge to identify efficient parameter combinations while maintaining adaptability through verbal learning from online interactions. Evaluations on the HLSyn dataset demonstrate that LLM-DSE achieves substantial $2.55\times$ performance gains over state-of-the-art methods, uncovering novel designs while reducing runtime. Ablation studies validate the effectiveness and necessity of the proposed agent interactions. Our code is open-sourced here: https://github.com/Nozidoali/LLM-DSE.
Chinese: LLM-DSE是一个多智能体框架,通过将大语言模型与设计空间探索相结合来优化高层次综合指令,在协调智能体交互的同时实现了比现有最优方法2.55倍的性能提升并减少了运行时间。
English: LLM-DSE is a multi-agent framework that optimizes high-level synthesis directives by combining large language models with design space exploration, achieving 2.55× performance gains over state-of-the-art methods while reducing runtime through coordinated agent interactions.

Authors:Omar Choukrani, Idriss Malek, Daniil Orel, Zhuohan Xie, Zangir Iklassov, Martin Takáč, Salem Lahlou
Title: LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs
Abstract:
Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ($\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ($\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ($\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ($\texttt{LLM-BabyBench-Predict}$, $\texttt{-Plan}$, $\texttt{-Decompose}$) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ($\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$, $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$).
中文: 本文提出LLM-BabyBench基准套件,基于文本版BabyAI网格世界评估大语言模型在预测行动结果、规划行动序列和分解指令方面的基础推理能力,并公开了数据集与评估工具。
English: This paper introduces LLM-BabyBench, a benchmark suite built on a text-based BabyAI grid world to evaluate LLMs' grounded reasoning abilities in predicting action consequences, planning action sequences, and decomposing instructions, with datasets and evaluation tools made publicly available.

Authors:Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
Title: HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
Abstract:
Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.
Chinese: HISTAI数据集作为一个大型、多模态、开放获取的资源,包含超过6万张全切片图像及丰富的临床元数据,旨在弥补现有资源的不足,推动数字病理学中人工智能模型的稳健性和创新发展。
English: The HISTAI dataset is introduced as a large, multimodal, open-access collection of over 60,000 whole slide images with extensive clinical metadata to address limitations in existing resources and enhance AI model robustness in digital pathology.

Authors:Jiarui Wang, Huiyu Duan, Ziheng Jia, Yu Zhao, Woo Yi Yang, Zicheng Zhang, Zijian Chen, Juntong Wang, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Title: LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation
Abstract:
Recent advancements in large multimodal models (LMMs) have driven substantial progress in both text-to-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. Therefore, a reliable and scalable automatic model for AIGV evaluation is desirable, which heavily relies on the scale and quality of human annotations. To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and (iii) bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level. Comprehensive experiments demonstrate that LOVE not only achieves state-of-the-art performance on the AIGVE-60K dataset, but also generalizes effectively to a wide range of other AIGV evaluation benchmarks. These findings highlight the significance of the AIGVE-60K dataset. Database and codes are anonymously available at https://github.com/IntMeGroup/LOVE.
中文:AIGVE-60K数据集通过大规模人工标注和新型多模态评估指标LOVE,解决了AI生成视频在感知质量与文本对齐方面的评估难题,并在多个基准测试中展现出最优性能。
English: The AIGVE-60K dataset and benchmark address limitations in AI-generated video evaluation by providing extensive human annotations and introducing LOVE, a multimodal metric that achieves state-of-the-art performance across multiple evaluation dimensions.

Authors:Ninghan Zhong, Steven Caro, Avraiem Iskandar, Megnath Ramesh, Stephen L. Smith
Title: Bench-NPIN: Benchmarking Non-prehensile Interactive Navigation
Abstract:
Mobile robots are increasingly deployed in unstructured environments where obstacles and objects are movable. Navigation in such environments is known as interactive navigation, where task completion requires not only avoiding obstacles but also strategic interactions with movable objects. Non-prehensile interactive navigation focuses on non-grasping interaction strategies, such as pushing, rather than relying on prehensile manipulation. Despite a growing body of research in this field, most solutions are evaluated using case-specific setups, limiting reproducibility and cross-comparison. In this paper, we present Bench-NPIN, the first comprehensive benchmark for non-prehensile interactive navigation. Bench-NPIN includes multiple components: 1) a comprehensive range of simulated environments for non-prehensile interactive navigation tasks, including navigating a maze with movable obstacles, autonomous ship navigation in icy waters, box delivery, and area clearing, each with varying levels of complexity; 2) a set of evaluation metrics that capture unique aspects of interactive navigation, such as efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-NPIN to evaluate example implementations of established baselines across environments. Bench-NPIN is an open-source Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.
中文摘要:本文提出首个非抓取式交互导航综合基准Bench-NPIN,包含模拟环境、评估指标和基线实现,通过模块化开源工具解决该领域可复现性不足的问题。
English Summary: This paper introduces Bench-NPIN, the first comprehensive benchmark for non-prehensile interactive navigation in robotics, featuring simulated environments, evaluation metrics, and baseline implementations to address reproducibility challenges in the field.

Authors:Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, Yao Lu
Title: SepPrune: Structured Pruning for Efficient Deep Speech Separation
Abstract:
Although deep learning has substantially advanced speech separation in recent years, most existing studies continue to prioritize separation quality while overlooking computational efficiency, an essential factor for low-latency speech processing in real-time applications. In this paper, we propose SepPrune, the first structured pruning framework specifically designed to compress deep speech separation models and reduce their computational cost. SepPrune begins by analyzing the computational structure of a given model to identify layers with the highest computational burden. It then introduces a differentiable masking strategy to enable gradient-driven channel selection. Based on the learned masks, SepPrune prunes redundant channels and fine-tunes the remaining parameters to recover performance. Extensive experiments demonstrate that this learnable pruning paradigm yields substantial advantages for channel pruning in speech separation models, outperforming existing methods. Notably, a model pruned with SepPrune can recover 85% of the performance of a pre-trained model (trained over hundreds of epochs) with only one epoch of fine-tuning, and achieves convergence 36$\times$ faster than training from scratch. Code is available at https://github.com/itsnotacie/SepPrune.
中文: 本文提出了SepPrune框架,通过分析模型计算结构并采用可微分掩码策略剪枝冗余通道,在保持语音分离性能的同时大幅提升计算效率,仅需一轮微调即可恢复85%性能。
English: This paper introduces SepPrune, a structured pruning framework that compresses deep speech separation models by identifying computationally heavy layers and using differentiable masking to prune redundant channels, achieving significant efficiency gains with minimal performance loss.

Authors:Yijie Zheng, Jinxuan Yang, Yu Chen, Yaxuan Wang, Yihang Lu, Guoqing Li
Title: Beluga Whale Detection from Satellite Imagery with Point Labels
Abstract:
Very high-resolution (VHR) satellite imagery has emerged as a powerful tool for monitoring marine animals on a large scale. However, existing deep learning-based whale detection methods usually require manually created, high-quality bounding box annotations, which are labor-intensive to produce. Moreover, existing studies often exclude ``uncertain whales'', individuals that have ambiguous appearances in satellite imagery, limiting the applicability of these models in real-world scenarios. To address these limitations, this study introduces an automated pipeline for detecting beluga whales and harp seals in VHR satellite imagery. The pipeline leverages point annotations and the Segment Anything Model (SAM) to generate precise bounding box annotations, which are used to train YOLOv8 for multiclass detection of certain whales, uncertain whales, and harp seals. Experimental results demonstrated that SAM-generated annotations significantly improved detection performance, achieving higher $\text{F}_\text{1}$-scores compared to traditional buffer-based annotations. YOLOv8 trained on SAM-labeled boxes achieved an overall $\text{F}_\text{1}$-score of 72.2% for whales overall and 70.3% for harp seals, with superior performance in dense scenes. The proposed approach not only reduces the manual effort required for annotation but also enhances the detection of uncertain whales, offering a more comprehensive solution for marine animal monitoring. This method holds great potential for extending to other species, habitats, and remote sensing platforms, as well as for estimating whale biometrics, thereby advancing ecological monitoring and conservation efforts. The codes for our label and detection pipeline are publicly available at http://github.com/voyagerxvoyagerx/beluga-seeker .
本研究提出了一种自动化流程,利用点标注和Segment Anything Model(SAM)生成精确边界框来训练YOLOv8,显著提升了在超高分辨率卫星图像中对白鲸、不确定鲸鱼和竖琴海豹的检测性能,同时减少了人工标注工作量。
This study introduces an automated pipeline using point annotations and the Segment Anything Model (SAM) to generate precise bounding boxes for training YOLOv8, significantly improving the detection of beluga whales, uncertain whales, and harp seals in very high-resolution satellite imagery while reducing manual annotation efforts.

Authors:Tiannuo Yang, Zebin Yao, Bowen Jin, Lixiao Cui, Yusen Li, Gang Wang, Xiaoguang Liu
Title: Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
Abstract:
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.
中文: 基于大语言模型的搜索代理因检索方法和系统设计存在效率瓶颈,而SearchAgent-X通过高召回近似检索、优先级调度和无中断检索技术,在不影响生成质量的前提下大幅提升了吞吐量并降低了延迟。
English: LLM-based search agents face efficiency issues from retrieval methods and system design, which SearchAgent-X addresses using high-recall retrieval, priority-aware scheduling, and non-stall retrieval to significantly boost throughput and reduce latency without sacrificing quality.

Authors:Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
Title: Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement
Abstract:
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE's effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at https://github.com/NJUNLP/SAGE.
中文摘要:本文提出无需训练的防御策略SAGE,通过协调大语言模型强大的安全识别能力与相对薄弱的安全生成能力,有效抵御复杂越狱攻击并保持通用帮助性,实现了高达99%的平均防御成功率。
English Summary: The paper introduces SAGE, a training-free defense strategy that enhances LLM safety by aligning their strong jailbreak detection with improved response generation, achieving high defense rates against sophisticated attacks while maintaining general helpfulness.

Authors:Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang
Title: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption
Abstract:
Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers. The code and weights are released at: https://github.com/KwaiVGI/VFRTok.
中文摘要:本文提出基于Transformer的视频分词器VFRTok,采用时长比例信息假设实现可变帧率编码,仅需1/8的令牌数量即可保持卓越的视频生成质量,同时引入部分旋转位置编码提升内容感知能力。
English Summary: The paper introduces VFRTok, a Transformer-based video tokenizer that adopts the Duration-Proportional Information Assumption to enable variable frame rate encoding, significantly reducing token usage by 1/8 while maintaining high video generation quality.

Authors:Yinghui Zhang, Tailin Chen, Yuchen Zhang, Zeyu Fu
Title: Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Abstract:
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model's effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.
Chinese: 针对TikTok和YouTube等平台上仇恨视频检测的难题,CMFusion模型通过时序跨模态注意力机制和通道-模态双融合策略,有效整合多模态特征,在真实数据集上显著提升了检测精度与召回率。
English: The proliferation of hate videos on platforms like TikTok and YouTube poses detection challenges, which CMFusion addresses through a multimodal fusion model that integrates temporal dynamics and cross-modal interactions, achieving superior performance in identifying nuanced harmful content.

Authors:Kazuhiko Kawamoto, Atsuhiro Endo, Hiroshi Kera
Title: Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment
Abstract:
Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space. Task arithmetic typically assumes that the source and target models are initialized from the same pre-trained parameters. This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets. To address this challenge, we propose a method based on few-shot orthogonal alignment, which aligns task vectors to the parameter space of a differently pre-trained target model. These transformations preserve key properties of task vectors, such as norm and rank, and are learned using only a small number of labeled examples. We evaluate the method using two Vision Transformers pre-trained on YFCC100M and LAION400M, and test on eight classification datasets. Experimental results show that our method improves transfer accuracy over direct task vector application and achieves performance comparable to few-shot fine-tuning, while maintaining the modularity and reusability of task vectors. Our code is available at https://github.com/kawakera-lab/CrossModelTransfer.
中文摘要:本研究提出一种少样本正交对齐方法,使任务向量能够在独立预训练的模型间实现有效迁移,在保持任务向量模块化特性的同时,获得了与微调相当的性能表现。
English Summary: This study introduces a few-shot orthogonal alignment method to enable task vector transfer between independently pre-trained models, achieving comparable performance to fine-tuning while preserving task vector modularity.

Authors:Mingcheng Qu, Guang Yang, Donglin Di, Tonghua Su, Yue Gao, Yang Song, Lei Fan
Title: Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance
Abstract:
Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4\% in C-Index performance.
Chinese: 本文提出了一种多模态生存预测框架,利用超图学习捕捉病理图像的上下文和层次细节,并通过模态再平衡机制和交互对齐策略解决病理与基因组数据间的不平衡问题,在TCGA数据集上C-Index性能提升超过3.4%。
English: This paper introduces a multimodal survival prediction framework using hypergraph learning to capture contextual and hierarchical details from pathology images, along with a modality rebalance mechanism and interactive alignment to address the imbalance between pathology and genomic data, achieving over 3.4% higher C-Index performance on TCGA datasets.

Authors:Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, Bo Han
Title: Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning
Abstract:
Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance -- the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows: (i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements. (ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite. (iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions. Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance. Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research. Our code is available at https://github.com/tmlr-group/SatImp.
中文: 损失重加权在大型语言模型的机器遗忘中通过饱和度和重要性两个目标得以阐明,其中饱和度更有效,两者结合的SatImp方法展现出更优性能。
English: Loss reweighting in machine unlearning for LLMs is clarified through two goals—Saturation and Importance—with Saturation proving more effective, and their combination in the SatImp method showing improved performance.

Authors:Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, Chi Chen
Title: ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing
Abstract:
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, editing charts via code presents a greater challenge. This task demands MLLMs to integrate chart understanding and reasoning capacities, which are labor-intensive. While many MLLMs claim such editing capabilities, current evaluations rely on limited case studies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose \textsc{ChartEdit}, a novel benchmark designed for chart editing tasks, featuring $1405$ diverse editing instructions applied to $233$ real-world charts, each manually annotated and validated for accuracy. Utilizing \textsc{ChartEdit}, we evaluate the performance of 10 mainstream MLLMs across two types of experiments at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.
Chinese: 提出的ChartEdit基准测试评估了多模态大语言模型在图表编辑任务中的表现,发现大型模型虽能部分匹配参考图像,但在精确修改方面存在困难,而小型模型在遵循指令和图表生成方面面临更大挑战。
English: The proposed ChartEdit benchmark evaluates multimodal large language models (MLLMs) on chart editing tasks, revealing that while large models can partially match reference images, they struggle with precise modifications, and small models face even greater challenges in instruction-following and chart generation.

Authors:Yuyao Zhang, Zhicheng Dou, Xiaoxi Li, Jiajie Jin, Yongkang Wu, Zhonghua Li, Qi Ye, Ji-Rong Wen
Title: Neuro-Symbolic Query Compiler
Abstract:
Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.
中文: 本文提出QCompiler这一神经符号框架,通过最小化的巴科斯范式语法形式化复杂查询,并将其编译为抽象语法树,显著提升了RAG系统处理复杂查询时文档检索与响应生成的精确度。
English: This paper introduces QCompiler, a neuro-symbolic framework that formalizes complex queries using a minimal Backus-Naur Form grammar and compiles them into Abstract Syntax Trees to enhance document retrieval and response generation in RAG systems.

Authors:Faruk Alpay
Title: XiSort: Deterministic Sorting via IEEE-754 Total Ordering and Entropy Minimization
Abstract:
We introduce XiSort, a deterministic and reproducible sorting algorithm for floating-point sequences based on IEEE-754 total ordering and entropy minimization. XiSort guarantees bit-for-bit stability across runs and platforms by resolving tie-breaking via information-theoretic and symbolic methods. The algorithm supports both in-memory and external (out-of-core) operation, offering consistent performance on large datasets. We formalize a curved variant of the sorting metric that integrates into the Alpay Algebra framework, treating XiSort as a recursive operator with provable convergence and symbolic idempotence. This model preserves state-space closure while minimizing local disorder, interpretable as symbolic entropy. Empirical benchmarks demonstrate that XiSort achieves competitive throughput (e.g., sorting 10^8 doubles in approximately 12 seconds in-memory, and 100 GB at around 100 MB/s on SSDs), with applications in scientific computing, high-frequency finance, and reproducible numerical workflows. The results position XiSort as a principled tool for stable data alignment, symbolic preprocessing, and cross-platform float ordering. Keywords: deterministic sorting, IEEE-754, entropy minimization, symbolic algebra, reproducibility, external memory, Alpay Algebra, data pipelines
Chinese: XiSort是一种基于IEEE-754全序和熵最小化的确定性浮点数排序算法,通过信息论和符号方法确保跨平台运行的比特级稳定性,在科学计算和高频金融等场景中为大规模数据集提供高效排序能力。
English: XiSort is a deterministic floating-point sorting algorithm that ensures bit-for-bit reproducibility across platforms by leveraging IEEE-754 total ordering and entropy minimization, offering efficient performance for large datasets in scientific and financial applications.

Authors:Haitao Li, Ziyu Li, Yiheng Mao, Zhengyao Ding, Zhengxing Huang
Title: DC-Seg: Disentangled Contrastive Learning for Brain Tumor Segmentation with Missing Modalities
Abstract:
Accurate segmentation of brain images typically requires the integration of complementary information from multiple image modalities. However, clinical data for all modalities may not be available for every patient, creating a significant challenge. To address this, previous studies encode multiple modalities into a shared latent space. While somewhat effective, it remains suboptimal, as each modality contains distinct and valuable information. In this study, we propose DC-Seg (Disentangled Contrastive Learning for Segmentation), a new method that explicitly disentangles images into modality-invariant anatomical representation and modality-specific representation, by using anatomical contrastive learning and modality contrastive learning respectively. This solution improves the separation of anatomical and modality-specific features by considering the modality gaps, leading to more robust representations. Furthermore, we introduce a segmentation-based regularizer that enhances the model's robustness to missing modalities. Extensive experiments on the BraTS 2020 and a private white matter hyperintensity(WMH) segmentation dataset demonstrate that DC-Seg outperforms state-of-the-art methods in handling incomplete multimodal brain tumor segmentation tasks with varying missing modalities, while also demonstrate strong generalizability in WMH segmentation. The code is available at https://github.com/CuCl-2/DC-Seg.
中文: DC-Seg通过解耦对比学习将脑图像分离为模态不变的解剖表征和模态特异性表征,有效提升了多模态数据缺失情况下的分割鲁棒性和准确性。
English: DC-Seg introduces a novel approach that disentangles brain images into modality-invariant anatomical and modality-specific representations through contrastive learning, significantly improving segmentation accuracy and robustness to missing modalities in clinical data.

Authors:Zhiheng Chen, Ruofan Wu, Guanhua Fang
Title: Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures
Abstract:
The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.
中文: 本文提出TGMM这一基于Transformer的框架,能够有效学习解决多个高斯混合模型任务,在超越传统方法性能的同时,理论上证明了其可近似EM算法和谱方法的核心组件。
English: This paper introduces TGMM, a transformer-based framework that effectively learns to solve multiple Gaussian Mixture Models, demonstrating superior performance over classical methods while providing theoretical guarantees of approximating both EM algorithms and spectral methods.

Authors:Pengfei Lyu, Pak-Hei Yeung, Xiaosheng Yu, Jing Xia, Jianning Chi, Chengdong Wu, Jagath C. Rajapakse
Title: Bridging the Inter-Domain Gap through Low-Level Features for Cross-Modal Medical Image Segmentation
Abstract:
This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recover the source images from their edge features, followed by training a segmentation model on the generated source images, separately. At test time, edge features from the target images are input to the pretrained generative model to generate source-style target domain images, which are then segmented using the pretrained segmentation network. Despite its simplicity, extensive experiments on various publicly available datasets demonstrate that \proposed achieves state-of-the-art performance, outperforming eleven existing UDA approaches under different settings. Notably, further ablation studies show that \proposed is agnostic to different types of generative and segmentation models, suggesting its potential to be seamlessly plugged with the most advanced models to achieve even more outstanding results in the future. The code is available at https://github.com/JoshuaLPF/LowBridge.
中文摘要:本文提出了一种模型无关的无监督域自适应框架LowBridge,通过利用跨模态图像共享的边缘特征将目标域图像转换为源域风格,在多种医学图像分割任务中实现了最优性能。
English Summary: This paper introduces LowBridge, a model-agnostic unsupervised domain adaptation framework for cross-modal medical image segmentation that leverages shared edge features to generate source-style images from target domains, achieving state-of-the-art performance across multiple datasets.

Authors:Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Title: MINGLE: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging
Abstract:
Continual model merging integrates independently fine-tuned models sequentially without access to the original training data, offering a scalable and efficient solution for continual learning. However, existing methods face two critical challenges: parameter interference among tasks, which leads to catastrophic forgetting, and limited adaptability to evolving test distributions. To address these issues, we introduce the task of Test-Time Continual Model Merging (TTCMM), which leverages a small set of unlabeled test samples during inference to alleviate parameter conflicts and handle distribution shifts. We propose MINGLE, a novel framework for TTCMM. MINGLE employs a mixture-of-experts architecture with parameter-efficient, low-rank experts, which enhances adaptability to evolving test distributions while dynamically merging models to mitigate conflicts. To further reduce forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations, thereby suppressing activations on old tasks and preserving past knowledge. We further introduce an Adaptive Relaxation Strategy that adjusts constraint strength dynamically based on interference signals observed during test-time adaptation, striking a balance between stability and adaptability. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, significantly reduces forgetting, and consistently surpasses previous state-of-the-art methods by 7-9\% on average across diverse task orders. Our code is available at: https://github.com/zihuanqiu/MINGLE
中文: 本文提出MINGLE框架,通过专家混合架构和零空间约束门控机制,在测试时持续合并模型中有效缓解参数冲突并适应分布变化,相比现有方法实现了7-9%的性能提升。
English: The paper introduces MINGLE, a novel framework for Test-Time Continual Model Merging that employs a mixture-of-experts architecture and Null-Space Constrained Gating to mitigate parameter interference and adapt to distribution shifts, achieving significant performance improvements over existing methods.

Authors:Shiming Chen, Dingjie Fu, Salman Khan, Fahad Shahbaz Khan
Title: GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder
Abstract:
Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at https://github.com/shiming-chen/GenZSL.
中文摘要:提出的GenZSL模型通过使用弱语义向量从相似已见类别中归纳新类样本,推进零样本学习,并采用类别多样性促进和目标类别引导优化策略,在效能和效率上实现显著提升。
English Summary: The proposed GenZSL model advances zero-shot learning by generating new class samples from similar seen classes using weak semantic vectors, enhancing diversity and optimizing performance with significant gains in efficacy and efficiency.

Authors:Chicago Y. Park, Shirin Shoushtari, Hongyu An, Ulugbek S. Kamilov
Title: Measurement Score-Based Diffusion Model
Abstract:
Diffusion models are widely used in applications ranging from image generation to inverse problems. However, training diffusion models typically requires clean ground-truth images, which are unavailable in many applications. We introduce the Measurement Score-based diffusion Model (MSM), a novel framework that learns partial measurement scores using only noisy and subsampled measurements. MSM models the distribution of full measurements as an expectation over partial scores induced by randomized subsampling. To make the MSM representation computationally efficient, we also develop a stochastic sampling algorithm that generates full images by using a randomly selected subset of partial scores at each step. We additionally propose a new posterior sampling method for solving inverse problems that reconstructs images using these partial scores. We provide a theoretical analysis that bounds the Kullback-Leibler divergence between the distributions induced by full and stochastic sampling, establishing the accuracy of the proposed algorithm. We demonstrate the effectiveness of MSM on natural images and multi-coil MRI, showing that it can generate high-quality images and solve inverse problems -- all without access to clean training data. Code is available at https://github.com/wustl-cig/MSM.
Chinese: 测量分数扩散模型(MSM)通过创新的部分分数学习和高效随机采样算法,仅使用含噪声的欠采样测量数据即可实现高质量图像生成和逆问题求解,无需依赖干净训练数据。
English: The Measurement Score-based diffusion Model (MSM) enables high-quality image generation and inverse problem solving using only noisy, subsampled measurements, eliminating the need for clean training data through its innovative partial score learning and efficient stochastic sampling algorithm.

Authors:Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li
Title: VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation
Abstract:
Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1% functional correctness on the VerilogEval Machine benchmark, substantially outperforming both comparable-sized models and much larger commercial systems like GPT-4 Turbo. Additionally, our approach demonstrates up to a 2.8X increase in first-attempt functional correctness compared to baseline methods and exhibits robust generalization to unseen designs. To our knowledge, VeriReason represents the first system to successfully integrate explicit reasoning capabilities with reinforcement learning for Verilog generation, establishing a new state-of-the-art for automated RTL synthesis. The models and datasets are available at: https://huggingface.co/collections/AI4EDA-CASE Code is Available at: https://github.com/NellyW8/VeriReason
中文:VeriReason提出了一种融合监督微调与强化学习的新框架,通过集成测试评估与自检机制,在VerilogEval基准测试中实现了83.1%功能正确率的突破性表现,显著优于现有大型商业模型。
English: VeriReason introduces a novel framework combining supervised fine-tuning with reinforcement learning to automate RTL code generation, achieving state-of-the-art 83.1% functional correctness on benchmarks while embedding self-checking mechanisms for error correction.

Authors:Ziyao Cui, Minxing Zhang, Jian Pei
Title: On Membership Inference Attacks in Knowledge Distillation
Abstract:
Nowadays, Large Language Models (LLMs) are trained on huge datasets, some including sensitive information. This poses a serious privacy concern because privacy attacks such as Membership Inference Attacks (MIAs) may detect this sensitive information. While knowledge distillation compresses LLMs into efficient, smaller student models, its impact on privacy remains underexplored. In this paper, we investigate how knowledge distillation affects model robustness against MIA. We focus on two questions. First, how is private data protected in teacher and student models? Second, how can we strengthen privacy preservation against MIAs in knowledge distillation? Through comprehensive experiments, we show that while teacher and student models achieve similar overall MIA accuracy, teacher models better protect member data, the primary target of MIA, whereas student models better protect non-member data. To address this vulnerability in student models, we propose 5 privacy-preserving distillation methods and demonstrate that they successfully reduce student models' vulnerability to MIA, with ensembling further stabilizing the robustness, offering a reliable approach for distilling more secure and efficient student models. Our implementation source code is available at https://github.com/richardcui18/MIA_in_KD.
中文摘要:本研究探讨了知识蒸馏对大型语言模型隐私保护的影响,发现学生模型在成员数据上对成员推理攻击更为脆弱,并提出了五种有效方法以增强其安全性。
English Summary: This study explores how knowledge distillation affects privacy in large language models, revealing that student models are more vulnerable to membership inference attacks on member data, and proposes five effective methods to enhance their security.

Authors:Jeremy Budd, Javier Ideami, Benjamin Macdowall Rynne, Keith Duggar, Randall Balestriero
Title: SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Abstract:
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp
中文: 本研究通过将稀疏自编码器置于样条理论框架中,深化了对其的理论理解,揭示了其在可解释性与准确性之间的权衡,并提出了新颖的PAM-SGD算法,在实验中展现出更优的样本效率和稀疏性。
English: This study enhances the theoretical understanding of sparse autoencoders (SAEs) by framing them within spline theory, revealing their trade-off between interpretability and accuracy compared to optimal piecewise affine autoencoders, and introduces a novel PAM-SGD algorithm that demonstrates improved sample efficiency and sparsity in experiments.

Authors:Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang
Title: Multilingual Collaborative Defense for Large Language Models
Abstract:
The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.
中文摘要:本文提出多语言协同防御(MCD)方法,通过自动优化安全提示来增强大语言模型的多语言防护能力,实验证明该方法在抵御跨语言越狱攻击方面优于现有方案,并展现出优秀的语言迁移性能。
English Summary: This paper introduces Multilingual Collaborative Defense (MCD), a novel method that automatically optimizes safety prompts to protect large language models from multilingual jailbreak attacks, demonstrating superior performance and transferability across languages compared to existing approaches.

Authors:Yansong Ning, Wei Li, Jun Fang, Naiqiang Tan, Hao Liu
Title: Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning
Abstract:
Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long$\otimes$Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at https://github.com/usail-hkust/LongShort.
中文:本文提出Long⊗Short协同推理框架,通过长思考与短思考模型分别生成关键思路和辅助思路,在多个基准测试中实现与现有模型相当性能的同时,将推理长度减少80%以上。
English: This paper introduces Long⊗Short, a collaborative reasoning framework where two LLMs generate important and remaining thoughts respectively, achieving comparable performance to existing models while reducing token usage by over 80% across multiple benchmarks.

Authors:Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
Title: Chain-of-Model Learning for Language Model
Abstract:
In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.
中文: 本文提出链式模型(CoM)这一新型学习范式,通过将隐藏状态构建为因果链实现渐进式模型扩展和弹性推理,在保持与标准Transformer相当性能的同时显著提升了训练效率和部署灵活性。
English: This paper introduces Chain-of-Model (CoM), a novel learning paradigm that enhances training efficiency and inference flexibility by structuring hidden states into causal chains, enabling progressive model scaling and elastic deployment.

Authors:Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z. Pan, Bei Yu
Title: UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings
Abstract:
Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant challenges for existing models, as they struggle to align all modality combinations within a unified embedding space during training, which degrades performance at inference. To address this limitation, we propose UniMoCo, a novel vision-language model architecture designed for multi-modal embedding tasks. UniMoCo introduces a modality-completion module that generates visual features from textual inputs, ensuring modality completeness for both queries and targets. Additionally, we develop a specialized training strategy to align embeddings from both original and modality-completed inputs, ensuring consistency within the embedding space. This enables the model to robustly handle a wide range of modality combinations across embedding tasks. Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings. More importantly, we identify and quantify the inherent bias in conventional approaches caused by imbalance of modality combinations in training data, which can be mitigated through our modality-completion paradigm. The code is available at https://github.com/HobbitQia/UniMoCo.
中文摘要:提出的UniMoCo模型通过引入模态补全模块和专门训练策略,解决了视觉语言任务中多种模态组合难以对齐的难题,在提升性能与鲁棒性的同时有效缓解了训练数据不平衡导致的固有偏差。
English Summary: The proposed UniMoCo model addresses the challenge of aligning diverse modality combinations in vision-language tasks by introducing a modality-completion module and specialized training strategy, demonstrating superior performance and robustness while mitigating inherent biases from training data imbalance.

Authors:Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, Bingxin Zhou
Title: VenusX: Unlocking Fine-Grained Functional Understanding of Proteins
Abstract:
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.
中文: VenusX作为首个在残基、片段和结构域层面进行精细蛋白质功能注释和配对的大规模基准,通过涵盖多种任务和场景,为模型评估提供了全面支持。
English: VenusX is introduced as the first large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels, enabling comprehensive evaluation of models across diverse tasks and scenarios.

Authors:Jian Zhu, He Wang, Yang Xu, Zebin Wu, Zhihui Wei
Title: Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model
Abstract:
Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion. Code is available at https://github.com/Zhu1116/ARGS-Diff.
Chinese: 本文提出ARGS-Diff自学习扩散模型,无需额外训练数据即可融合低分辨率高光谱与高分辨率多光谱图像,通过自适应残差引导机制在性能与计算效率上均超越现有最优方法。
English: This paper introduces ARGS-Diff, a self-learning diffusion model that fuses low-resolution hyperspectral and high-resolution multispectral images without external training data, achieving superior performance and efficiency through adaptive residual guidance.

Authors:Hancan Zhu, Jinhao Chen, Guanghua He
Title: MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation
Abstract:
Medical image segmentation relies heavily on convolutional neural networks (CNNs) and Transformer-based models. However, CNNs are constrained by limited receptive fields, while Transformers suffer from scalability challenges due to their quadratic computational complexity. To address these limitations, recent advances have explored alternative architectures. The state-space model Mamba offers near-linear complexity while capturing long-range dependencies, and the Kolmogorov-Arnold Network (KAN) enhances nonlinear expressiveness by replacing fixed activation functions with learnable ones. Building on these strengths, we propose MedVKAN, an efficient feature extraction model integrating Mamba and KAN. Specifically, we introduce the EFC-KAN module, which enhances KAN with convolutional operations to improve local pixel interaction. We further design the VKAN module, integrating Mamba with EFC-KAN as a replacement for Transformer modules, significantly improving feature extraction. Extensive experiments on five public medical image segmentation datasets show that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results validate the potential of Mamba and KAN for medical image segmentation while introducing an innovative and computationally efficient feature extraction framework. The code is available at: https://github.com/beginner-cjh/MedVKAN.
Chinese: MedVKAN结合了Mamba模型的近线性复杂度和改进的Kolmogorov-Arnold网络(KAN),有效克服了CNN和Transformer的局限性,在多数医学图像分割数据集上实现了最优性能。
English: MedVKAN integrates the Mamba model for near-linear complexity and the enhanced Kolmogorov-Arnold Network (KAN) to overcome limitations of CNNs and Transformers, achieving state-of-the-art performance on most medical image segmentation datasets.

Authors:Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, Yinyu Ye
Title: Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling
Abstract:
Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL), a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward by leveraging external optimization solvers as verifiers. These verifiers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality, serving as direct rewards for the RL process. This automated verification process, particularly from classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.
中文: SIRL框架通过强化学习和优化求解器验证,显著提升大语言模型生成可执行优化模型的准确性,在多项基准测试中达到领先水平。
English: The SIRL framework leverages reinforcement learning with optimization solvers as verifiers to enhance LLMs' accuracy in generating executable optimization models, achieving state-of-the-art performance on benchmarks.

Authors:Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen
Title: Towards Universal Semantics With Large Language Models
Abstract:
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.
中文: 本研究首次利用大语言模型自动生成自然语义元语言释义,优化后的模型在准确性和跨语言可译性上超越GPT-4o,为自然语言处理的语义表征开辟了新途径。
English: This study pioneers the use of large language models to automatically generate Natural Semantic Metalanguage explications, with fine-tuned models outperforming GPT-4o in accuracy and cross-translatability, advancing universal semantic representation for NLP applications.

Authors:Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Title: Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
Abstract:
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
中文: 多跳问答因因果掩码对语言模型构成挑战,但Flan-T5等编码器-解码器模型优于较小的仅解码器模型,当文档顺序与推理链一致且注意力权重在正确答案时达到峰值时,性能显著提升。
English: Multi-hop question answering presents challenges for language models due to causal masks, but encoder-decoder models like Flan-T5 outperform smaller decoder-only models, with performance improving when document order matches reasoning chains and attention weights peak for correct answers.

Authors:Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
Title: MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
Abstract:
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.
中文:MedCaseReasoning数据集旨在评估大语言模型与临床医生诊断推理的一致性,发现现有模型存在明显不足,但通过基于临床推理的微调可显著提升诊断准确性和推理还原能力。
English: The MedCaseReasoning dataset is introduced to evaluate LLMs' alignment with clinician diagnostic reasoning, revealing significant gaps in current models' performance but showing substantial improvement through fine-tuning on clinical reasoning traces.

Authors:Shun Inadumi, Nobuhiro Ueda, Koichiro Yoshino
Title: Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Abstract:
Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.
中文摘要:本文提出一个统一文本和多模态指代消解的框架,通过将指称映射到对象并利用相似性来增强短语定位,实验表明该框架在代词消解等任务上优于MDETR和GLIP等模型,有效减少视觉对话中的歧义。
English Summary: This paper introduces a unified framework for textual and multimodal reference resolution that enhances phrase grounding by mapping mentions to objects and leveraging similarities, with experiments showing improved performance, especially in pronoun resolution, over models like MDETR and GLIP.

Authors:Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang
Title: EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Abstract:
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
中文摘要:EgoDex数据集通过苹果Vision Pro设备采集了829小时包含实时3D手部追踪的自我中心视角视频,覆盖194种日常家居操作任务,有效解决了操作模仿学习中的数据稀缺问题,为机器人和计算机视觉领域提供了重要资源。
English Summary: The EgoDex dataset addresses data scarcity in imitation learning for manipulation by providing 829 hours of egocentric video with real-time 3D hand tracking, collected using Apple Vision Pro across 194 household tasks, enabling advancements in robotics and computer vision.

Authors:Haipeng Fang, Sheng Tang, Juan Cao, Enshuo Zhang, Fan Tang, Tong-Yee Lee
Title: Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration
Abstract:
Diffusion transformers have shown exceptional performance in visual generation but incur high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into any DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, for example, it achieves 1.55 times acceleration with negligible impact on image quality. Project page: https://github.com/ICTMCG/SDTM.
中文: 本研究提出SDTM方法,通过结构优先-细节补充的令牌合并策略,利用扩散去噪先验动态压缩特征冗余,在多种DiT架构上实现1.55倍加速且几乎不影响图像质量。
English: This study introduces SDTM, a structure-then-detail token merging method that leverages diffusion denoising priors to dynamically compress feature redundancies, achieving 1.55× acceleration with minimal quality loss across various DiT architectures.

Authors:Jae Myung Kim, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
Title: LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance
Abstract:
Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement. The code is available at https://github.com/ExplainableML/LoFT.
中文: LoFT框架通过在单个真实图像上微调LoRA权重并在推理时融合它们,提升了合成数据集的多样性和保真度,从而在模型训练中持续优于其他方法。
English: The LoFT framework enhances synthetic dataset generation by fine-tuning LoRA weights on individual real images and fusing them during inference, resulting in improved diversity and fidelity that consistently outperforms other methods in downstream model training.

Authors:Keenan Eikenberry, Lizuo Liu, Yoonsang Lee
Title: Invariant Representations via Wasserstein Correlation Maximization
Abstract:
This work investigates the use of Wasserstein correlation -- a normalized measure of statistical dependence based on the Wasserstein distance between a joint distribution and the product of its marginals -- for unsupervised representation learning. Unlike, for example, contrastive methods, which naturally cluster classes in the latent space, we find that an (auto)encoder trained to maximize Wasserstein correlation between the input and encoded distributions instead acts as a compressor, reducing dimensionality while approximately preserving the topological and geometric properties of the input distribution. More strikingly, we show that Wasserstein correlation maximization can be used to arrive at an (auto)encoder -- either trained from scratch, or else one that extends a frozen, pretrained model -- that is approximately invariant to a chosen augmentation, or collection of augmentations, and that still approximately preserves the structural properties of the non-augmented input distribution. To do this, we first define the notion of an augmented encoder using the machinery of Markov-Wasserstein kernels. When the maximization objective is then applied to the augmented encoder, as opposed to the underlying, deterministic encoder, the resulting model exhibits the desired invariance properties. Finally, besides our experimental results, which show that even simple feedforward networks can be imbued with invariants or can, alternatively, be used to impart invariants to pretrained models under this training process, we additionally establish various theoretical results for optimal transport-based dependence measures. Code is available at https://github.com/keenan-eikenberry/wasserstein_correlation_maximization .
本研究提出将Wasserstein相关性作为无监督表征学习的工具,证明其能使(自)编码器在压缩数据的同时保持结构特征,并对特定数据增强具备不变性。
This study introduces Wasserstein correlation as a tool for unsupervised representation learning, demonstrating that it enables (auto)encoders to compress data while preserving its structure and achieve invariance to specified augmentations.

Authors:Hung Nguyen, Alireza Rahimi, Veronica Whitford, Hélène Fournier, Irina Kondratova, René Richard, Hung Cao
Title: Heart2Mind: Human-Centered Contestable Psychiatric Disorder Diagnosis System using Wearable ECG Monitors
Abstract:
Psychiatric disorders affect millions globally, yet their diagnosis faces significant challenges in clinical practice due to subjective assessments and accessibility concerns, leading to potential delays in treatment. To help address this issue, we present Heart2Mind, a human-centered contestable psychiatric disorder diagnosis system using wearable electrocardiogram (ECG) monitors. Our approach leverages cardiac biomarkers, particularly heart rate variability (HRV) and R-R intervals (RRI) time series, as objective indicators of autonomic dysfunction in psychiatric conditions. The system comprises three key components: (1) a Cardiac Monitoring Interface (CMI) for real-time data acquisition from Polar H9/H10 devices; (2) a Multi-Scale Temporal-Frequency Transformer (MSTFT) that processes RRI time series through integrated time-frequency domain analysis; (3) a Contestable Diagnosis Interface (CDI) combining Self-Adversarial Explanations (SAEs) with contestable Large Language Models (LLMs). Our MSTFT achieves 91.7% accuracy on the HRV-ACC dataset using leave-one-out cross-validation, outperforming state-of-the-art methods. SAEs successfully detect inconsistencies in model predictions by comparing attention-based and gradient-based explanations, while LLMs enable clinicians to validate correct predictions and contest erroneous ones. This work demonstrates the feasibility of combining wearable technology with Explainable Artificial Intelligence (XAI) and contestable LLMs to create a transparent, contestable system for psychiatric diagnosis that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: https://github.com/Analytics-Everywhere-Lab/heart2mind.
中文: Heart2Mind系统通过可穿戴心电监测设备分析心脏生物标志物,结合可解释人工智能和可争议大语言模型,构建了一个临床可监督的透明精神疾病诊断平台,在保持高准确率的同时允许医生对诊断结果进行验证和争议。
English: Heart2Mind is a contestable psychiatric diagnosis system that uses wearable ECG monitors to analyze cardiac biomarkers through AI, achieving high accuracy while enabling clinician validation through explainable AI and contestable large language models.

Authors:Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
Title: SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Abstract:
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.
中文: 本研究通过利用Blackwell GPU的FP4张量核心将注意力推理速度提升5倍,并开创性地将8位注意力应用于训练任务,在微调中实现无损性能,但预训练收敛较慢。
English: This study enhances attention efficiency by utilizing FP4 Tensor Cores in Blackwell GPUs for a 5x inference speedup and pioneers 8-bit attention for training, achieving lossless fine-tuning results despite slower pretraining convergence.

Authors:Rui Zhang, Yun Shen, Hongwei Li, Wenbo Jiang, Hanxiao Chen, Yuan Zhang, Guowen Xu, Yang Zhang
Title: The Ripple Effect: On Unforeseen Complications of Backdoor Attacks
Abstract:
Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks. Our code is available at https://github.com/zhangrui4041/Backdoor_Complications.
中文摘要:本研究首次量化分析了预训练语言模型中的后门并发症现象,通过多任务学习方法在保持攻击效果的同时有效减轻了其对下游任务的影响。
English Summary: This study identifies and quantifies backdoor complications in pre-trained language models, revealing their widespread impact on downstream tasks and proposing a multi-task learning method to mitigate these issues while preserving attack effectiveness.

Authors:Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann
Title: Flash Invariant Point Attention
Abstract:
Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.
中文摘要:FlashIPA是对不变点注意力算法的优化重构,实现了与序列长度的线性计算扩展,可在保持或提升性能的同时支持更长序列的训练。
English Summary: FlashIPA is a computationally efficient reformulation of Invariant Point Attention that achieves linear scaling with sequence length, enabling training on longer sequences while maintaining or improving performance.

Authors:Stylianos Stasinos, Martino Mensio, Elena Lazovik, Athanasios Trantas
Title: BioCube: A Multimodal Dataset for Biodiversity Research
Abstract:
Biodiversity research requires complete and detailed information to study ecosystem dynamics at different scales. Employing data-driven methods like Machine Learning is getting traction in ecology and more specific biodiversity, offering alternative modelling pathways. For these methods to deliver accurate results there is the need for large, curated and multimodal datasets that offer granular spatial and temporal resolutions. In this work, we introduce BioCube, a multimodal, fine-grained global dataset for ecology and biodiversity research. BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables. All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020. The dataset will become available at https://huggingface.co/datasets/BioDT/BioCube while the acquisition and processing code base at https://github.com/BioDT/bfm-data.
中文: BioCube是一个面向生态与生物多样性研究的多模态全球数据集,整合了2000至2020年间具有精细时空分辨率的多源数据,为机器学习在生态学中的应用提供支持。
English: BioCube is a comprehensive global dataset for biodiversity research, integrating multimodal data from 2000 to 2020 with fine-grained spatial and temporal resolution to support machine learning applications in ecology.

Authors:Tianyi Shi, Zhu Meng, Yue Chen, Siyang Zheng, Fei Su, Jin Huang, Changrui Ren, Zhicheng Zhao
Title: OLMA: One Loss for More Accurate Time Series Forecasting
Abstract:
Time series forecasting faces two important but often overlooked challenges. Firstly, the inherent random noise in the time series labels sets a theoretical lower bound for the forecasting error, which is positively correlated with the entropy of the labels. Secondly, neural networks exhibit a frequency bias when modeling the state-space of time series, that is, the model performs well in learning certain frequency bands but poorly in others, thus restricting the overall forecasting performance. To address the first challenge, we prove a theorem that there exists a unitary transformation that can reduce the marginal entropy of multiple correlated Gaussian processes, thereby providing guidance for reducing the lower bound of forecasting error. Furthermore, experiments confirm that Discrete Fourier Transform (DFT) can reduce the entropy in the majority of scenarios. Correspondingly, to alleviate the frequency bias, we jointly introduce supervision in the frequency domain along the temporal dimension through DFT and Discrete Wavelet Transform (DWT). This supervision-side strategy is highly general and can be seamlessly integrated into any supervised learning method. Moreover, we propose a novel loss function named OLMA, which utilizes the frequency domain transformation across both channel and temporal dimensions to enhance forecasting. Finally, the experimental results on multiple datasets demonstrate the effectiveness of OLMA in addressing the above two challenges and the resulting improvement in forecasting accuracy. The results also indicate that the perspectives of entropy and frequency bias provide a new and feasible research direction for time series forecasting. The code is available at: https://github.com/Yuyun1011/OLMA-One-Loss-for-More-Accurate-Time-Series-Forecasting.
中文: 本研究针对时间序列预测中的两个关键挑战——随机噪声设定的理论误差下界和神经网络频率偏差,提出通过酉变换降低标签熵并引入频域监督与新型OLMA损失函数,在多个数据集上显著提升了预测精度。
English: This study addresses two key challenges in time series forecasting—random noise setting a theoretical error bound and neural network frequency bias—by proposing a unitary transformation to reduce label entropy and introducing frequency domain supervision with a novel OLMA loss function, which significantly improves forecasting accuracy across multiple datasets.

Authors:Kyla Guru, Robert J. Moss, Mykel J. Kochenderfer
Title: On Technique Identification and Threat-Actor Attribution using LLMs and Embedding Models
Abstract:
Attribution of cyber-attacks remains a complex but critical challenge for cyber defenders. Currently, manual extraction of behavioral indicators from dense forensic documentation causes significant attribution delays, especially following major incidents at the international scale. This research evaluates large language models (LLMs) for cyber-attack attribution based on behavioral indicators extracted from forensic documentation. We test OpenAI's GPT-4 and text-embedding-3-large for identifying threat actors' tactics, techniques, and procedures (TTPs) by comparing LLM-generated TTPs against human-generated data from MITRE ATT&CK Groups. Our framework then identifies TTPs from text using vector embedding search and builds profiles to attribute new attacks for a machine learning model to learn. Key contributions include: (1) assessing off-the-shelf LLMs for TTP extraction and attribution, and (2) developing an end-to-end pipeline from raw CTI documents to threat-actor prediction. This research finds that standard LLMs generate TTP datasets with noise, resulting in a low similarity to human-generated datasets. However, the TTPs generated are similar in frequency to those within the existing MITRE datasets. Additionally, although these TTPs are different than human-generated datasets, our work demonstrates that they still prove useful for training a model that performs above baseline on attribution. Project code and files are contained here: https://github.com/kylag/ttp_attribution.
中文摘要:本研究评估了利用大型语言模型从取证文档中提取行为指标以实现网络攻击归因,发现尽管模型生成的数据与人工数据集存在差异,但仍能有效训练出超越基准归因性能的机器学习模型。
English Summary: This study explores using large language models to automate cyber-attack attribution by extracting behavioral indicators from forensic documents, finding that while LLM-generated data differs from human-curated datasets, it effectively trains models to surpass baseline attribution performance.

Authors:Jiacheng Hou, Zhenjie Song, Ercan Engin Kuruoglu
Title: BrainNetMLP: An Efficient and Effective Baseline for Functional Brain Network Classification
Abstract:
Recent studies have made great progress in functional brain network classification by modeling the brain as a network of Regions of Interest (ROIs) and leveraging their connections to understand brain functionality and diagnose mental disorders. Various deep learning architectures, including Convolutional Neural Networks, Graph Neural Networks, and the recent Transformer, have been developed. However, despite the increasing complexity of these models, the performance gain has not been as salient. This raises a question: Does increasing model complexity necessarily lead to higher classification accuracy? In this paper, we revisit the simplest deep learning architecture, the Multi-Layer Perceptron (MLP), and propose a pure MLP-based method, named BrainNetMLP, for functional brain network classification, which capitalizes on the advantages of MLP, including efficient computation and fewer parameters. Moreover, BrainNetMLP incorporates a dual-branch structure to jointly capture both spatial connectivity and spectral information, enabling precise spatiotemporal feature fusion. We evaluate our proposed BrainNetMLP on two public and popular brain network classification datasets, the Human Connectome Project (HCP) and the Autism Brain Imaging Data Exchange (ABIDE). Experimental results demonstrate pure MLP-based methods can achieve state-of-the-art performance, revealing the potential of MLP-based models as more efficient yet effective alternatives in functional brain network classification. The code will be available at https://github.com/JayceonHo/BrainNetMLP.
中文: 本文提出BrainNetMLP,一种基于多层感知机的双分支结构方法,能同时捕捉空间连接和频谱信息,在功能脑网络分类中实现最优性能,同时展现出高效性和有效性。
English: This paper introduces BrainNetMLP, a pure Multi-Layer Perceptron-based method with a dual-branch structure that captures both spatial connectivity and spectral information, achieving state-of-the-art performance in functional brain network classification while demonstrating efficiency and effectiveness.

Authors:Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Title: SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
Abstract:
Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model's parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at https://github.com/xuyige/SoftCoT.
中文:SoftCoT++ 通过扰动和对比学习实现潜在思维路径的多样化探索,在不改变模型参数的情况下,显著提升了多个推理基准的性能并优于现有方法。
English: SoftCoT++ enhances reasoning by diversifying latent thought exploration through perturbations and contrastive learning, outperforming existing methods across multiple benchmarks without altering model parameters.

Authors:Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, Bo Dai
Title: PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment
Abstract:
Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.
Chinese: PSDiffusion是一种统一的扩散框架,通过单次前向处理同时生成多层图像,既保证各图层的高质量,又确保图层间的全局交互协调一致。
English: PSDiffusion is a unified diffusion framework that simultaneously generates multi-layer images with a single feed-forward process, ensuring high-quality individual layers and coherent global interactions among them.

Authors:Yiming Lei, Chenkai Zhang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art
Abstract:
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.
中文: GODBench是一个结合视频与文本的新基准,用于评估多模态大语言模型创作视频弹幕的能力,而提出的涟漪思维框架通过改进幽默与讽刺生成的现有局限,显著提升了模型的创意表达水平。
English: GODBench is a new benchmark combining video and text to evaluate multimodal large language models' ability to create engaging video comments, while the proposed Ripple of Thought framework enhances their creative expression by addressing current limitations in humor and satire generation.

Authors:Adrian Robert Minut, Tommaso Mencattini, Andrea Santilli, Donato Crisostomi, Emanuele RodolÃ
Title: Mergenetic: a Simple Evolutionary Model Merging Library
Abstract:
Model merging allows combining the capabilities of existing models into a new one - post hoc, without additional training. This has made it increasingly popular thanks to its low cost and the availability of libraries that support merging on consumer GPUs. Recent work shows that pairing merging with evolutionary algorithms can boost performance, but no framework currently supports flexible experimentation with such strategies in language models. We introduce Mergenetic, an open-source library for evolutionary model merging. Mergenetic enables easy composition of merging methods and evolutionary algorithms while incorporating lightweight fitness estimators to reduce evaluation costs. We describe its design and demonstrate that Mergenetic produces competitive results across tasks and languages using modest hardware.
中文: Mergenetic 是一个开源库,它通过结合多种模型融合方法和进化算法,并采用轻量级适应度评估器,在普通硬件上实现了跨任务和语言的优异性能。
English: Mergenetic is an open-source library that facilitates evolutionary model merging by combining various merging methods and algorithms with lightweight fitness estimators, delivering competitive performance across tasks and languages on modest hardware.

Authors:Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki Kälviäinen
Title: EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
Abstract:
Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at https://github.com/xxtars/EmotionHallucer.
中文: 本文提出了首个检测多模态大语言模型中情绪幻觉的基准EmotionHallucer,揭示了当前模型普遍存在该问题,并提出的框架使检测性能平均提升9.90%。
English: This paper introduces EmotionHallucer, the first benchmark for detecting emotion hallucinations in Multimodal Large Language Models, revealing widespread issues and proposing a framework that improves detection by 9.90%.

Authors:Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Title: Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner
Abstract:
Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose Patho-CLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both Patho-CLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: https://github.com/Wenchuan-Zhang/Patho-R1.
中文摘要:本研究提出Patho-R1,这是一种基于病理学教材和专家知识构建的高质量数据集、通过三阶段训练流程开发的多模态病理推理模型,在多种病理任务中展现出卓越性能。
English Summary: This study introduces Patho-R1, a multimodal pathology reasoning model trained using a three-stage pipeline on high-quality datasets developed with pathology textbooks and expert input, achieving robust performance across various pathology tasks.

Authors:Petr Kasalický, Martin Spišák, Vojtěch Vančura, Daniel Bohuněk, Rodrigo Alves, Pavel Kordík
Title: The Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems
Abstract:
Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE.
中文摘要:本文提出一种可学习的嵌入压缩技术,通过将稠密嵌入投影到高维稀疏空间,在保持检索性能的同时显著降低内存需求,适用于资源受限的大规模推荐系统部署。
English Summary: The paper introduces a learnable embedding compression method that projects dense embeddings into a high-dimensional sparse space, reducing memory usage while maintaining retrieval performance for large-scale recommender systems.

Authors:Zeyu Gao, Yuxin Cui, Hao Wang, Siliang Qin, Yuanda Wang, Bolun Zhang, Chao Zhang
Title: DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
Abstract:
Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), \textit{runtime-aware validation}, and \textit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \href{https://github.com/Jennieett/DecompileBench}{DecompileBench} to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.
Chinese: DecompileBench 提出了首个全面评估反编译器的框架,通过真实世界函数、运行时验证和自动化人本评估,发现尽管功能正确性较低,基于LLM的方法在代码可理解性上优于商业工具。
English: DecompileBench introduces the first comprehensive framework for evaluating decompilers using real-world functions, runtime validation, and automated human-centric assessment, revealing that LLM-based methods excel in code understandability despite lower functionality correctness compared to commercial tools.

Authors:Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Title: TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Abstract:
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
Chinese: TokenWeave通过令牌分割技术和优化内核,在分布式大语言模型推理中重叠通信与计算以减少开销,实现了高达1.29倍的延迟加速和1.26倍的吞吐量提升。
English: TokenWeave introduces a token-splitting technique and optimized kernels to reduce overheads in distributed LLM inference by overlapping communication with computation, achieving up to 1.29x speedup in latency and 1.26x higher throughput.

Authors:Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Title: TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Abstract:
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
Chinese: TokenWeave通过令牌分割技术和优化内核,在分布式大语言模型推理中重叠通信与计算以减少开销,实现了高达1.29倍的延迟加速和1.26倍的吞吐量提升。
English: TokenWeave introduces a token-splitting technique and optimized kernels to reduce overheads in distributed LLM inference by overlapping communication with computation, achieving up to 1.29x speedup in latency and 1.26x higher throughput.

Authors:Keunwoo Peter Yu, Joyce Chai
Title: Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Abstract:
Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$.
中文摘要:该研究提出了时序语言生成(TGLG)基准任务,用于评估视觉语言模型在实时交互环境中的表现,并开发了时间同步交错模型(VLM-TSI),其性能虽优于基线但仍凸显了实现语义准确性与时间精准性协同的持续挑战。
English Summary: The study introduces Temporally-Grounded Language Generation (TGLG), a benchmark task for evaluating vision-language models in real-time interactive settings, proposing a new model (VLM-TSI) that outperforms baselines but highlights the ongoing challenges in achieving precise semantic and temporal alignment.

Authors:Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K. R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro
Title: Meta-World+: An Improved, Standardized, RL Benchmark
Abstract:
Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.
中文: 本研究澄清了Meta-World中阻碍算法公平比较的未记录变更,并发布了新的开源版本,确保结果可复现、提升使用便利性,并支持任务集自定义。
English: This work clarifies undocumented changes in Meta-World that hindered fair algorithm comparisons and releases a new open-source version ensuring reproducibility, improved usability, and customizable task sets.

Authors:Wilson Wongso, Hao Xue, Flora D. Salim
Title: Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks
Abstract:
Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS
中文: Massive-STEPS数据集通过提供近期、大规模且地理多样性的基准数据,解决了兴趣点推荐研究中的局限性,以促进人类移动性研究的可复现性和公平性。
English: The Massive-STEPS dataset addresses limitations in POI recommendation research by providing a recent, large-scale, and geographically diverse benchmark to enhance the study of human mobility.

Authors:Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Title: Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
Abstract:
Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
中文: 本文提出CDGLT这一训练高效的多模态隐喻识别框架,通过概念漂移和提示构建策略弥合字面与比喻特征间的鸿沟,在降低计算成本的同时于MET-Meme基准测试中实现最优性能。
English: This paper introduces CDGLT, a training-efficient framework for multimodal metaphor identification that uses Concept Drift and prompt construction to bridge literal-figurative gaps while reducing computational costs, achieving state-of-the-art performance on the MET-Meme benchmark.

Authors:Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Title: Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
Abstract:
Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher effective rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT. The code is publicly available at https://github.com/fei407/PSOFT.
中文: PSOFT通过主成分子空间适配优化正交微调,在保持语义的同时提升表达能力和效率,适用于多种自然语言处理与计算机视觉任务。
English: PSOFT introduces principal subspace adaptation to orthogonal fine-tuning, enabling semantic preservation while enhancing expressiveness and efficiency across NLP and CV tasks.

Authors:Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan
Title: RanDeS: Randomized Delta Superposition for Multi-Model Compression
Abstract:
From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.
Chinese: 本文提出了一种采用随机正交变换的模型合并技术,通过解耦任务特定参数调整来减少干扰,在无需为新增模型分配额外内存的情况下,提升了视觉和语言任务的性能。
English: This paper introduces a model merging technique using random orthogonal transformations to decorrelate task-specific parameter adjustments, reducing interference and improving performance across vision and language tasks without requiring extra memory for new models.

Authors:Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang
Title: DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
Abstract:
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet generation benchmarks, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, experimental results on MS-COCO demonstrate that the purely convolutional DiCo exhibits strong potential for text-to-image generation. Code: https://github.com/shallowdream204/DiCo.
Chinese: Diffusion ConvNet(DiCo)模型通过采用带有紧凑通道注意力机制的增强卷积网络替代扩散变换器中计算成本高昂的全局自注意力,在图像生成任务中实现了卓越的生成性能和显著的效率提升。
English: The Diffusion ConvNet (DiCo) model replaces the computationally expensive global self-attention in Diffusion Transformers with an enhanced convolutional network featuring a compact channel attention mechanism, achieving superior generative performance and significant efficiency gains in image generation tasks.

Authors:Sicheng Shen, Dongcheng Zhao, Linghao Feng, Zeyang Yue, Jindong Li, Tenglong Li, Guobin Shen, Yi Zeng
Title: STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking
Abstract:
Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce \textbf{STEP}, a unified benchmark framework for Spiking Transformers that supports a wide range of tasks, including classification, segmentation, and detection across static, event-based, and sequential datasets. STEP provides modular support for diverse components such as spiking neurons, input encodings, surrogate gradients, and multiple backends (e.g., SpikingJelly, BrainCog). Using STEP, we reproduce and evaluate several representative models, and conduct systematic ablation studies on attention design, neuron types, encoding schemes, and temporal modeling capabilities. We also propose a unified analytical model for energy estimation, accounting for spike sparsity, bitwidth, and memory access, and show that quantized ANNs may offer comparable or better energy efficiency. Our results suggest that current Spiking Transformers rely heavily on convolutional frontends and lack strong temporal modeling, underscoring the need for spike-native architectural innovations. The full code is available at: https://github.com/Fancyssc/STEP
中文: 本文提出STEP统一基准框架,用于标准化评估脉冲Transformer,发现现有模型存在时序建模能力不足和过度依赖卷积前端的问题。
English: This paper introduces STEP, a unified benchmark for Spiking Transformers that enables standardized evaluation across tasks and reveals their current limitations in temporal modeling and reliance on convolutional frontends.

Authors:Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Title: One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework
Abstract:
Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability). In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability. Codes are available at https://github.com/Ferry-Li/Co-Erasing.
中文摘要:Co-Erasing框架通过结合文本提示与视觉监督,采用负向引导和概念精炼策略,在保持良性概念可用性的同时有效消除文本到图像模型中的不良概念,实现了效能与可用性的最佳平衡。
English Summary: The proposed Co-Erasing framework integrates visual supervision with text prompts to effectively remove undesirable concepts from text-to-image models while minimizing impact on benign concepts, achieving superior performance through negative guidance and concept refinement.

Authors:Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Title: Dual-Balancing for Physics-Informed Neural Networks
Abstract:
Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty level. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https://github.com/chenhong-zhou/DualBalanced-PINNs.
中文: 本文提出了一种双平衡物理信息神经网络(DB-PINN),通过集成间平衡和内部平衡机制动态调整损失权重,有效解决了PINNs中的梯度失衡和条件拟合难度不均问题,在收敛速度和预测精度上显著优于主流方法。
English: This paper introduces a Dual-Balanced PINN (DB-PINN) that dynamically adjusts loss weights through inter-balancing and intra-balancing mechanisms to address gradient and fitting imbalances in PINNs, achieving superior convergence speed and prediction accuracy compared to existing methods.

Authors:Lin Zhu, Yijun Bian, Lei You
Title: FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation
Abstract:
Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.
中文: FairSHAP是一种新颖的预处理框架,利用Shapley值识别并修改训练数据中影响公平性的关键实例,在保持数据完整性和模型准确性的同时,显著提升了个体与群体公平性。
English: FairSHAP is a novel pre-processing framework that uses Shapley values to identify and modify fairness-critical instances in training data, enhancing both individual and group fairness while preserving data integrity and model accuracy.

Authors:Lin Zhu, Yijun Bian, Lei You
Title: FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation
Abstract:
Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.
中文: FairSHAP是一种新颖的预处理框架,利用Shapley值识别并修改训练数据中影响公平性的关键实例,在保持数据完整性和模型准确性的同时,显著提升了个体与群体公平性。
English: FairSHAP is a novel pre-processing framework that uses Shapley values to identify and modify fairness-critical instances in training data, enhancing both individual and group fairness while preserving data integrity and model accuracy.

Authors:Bin Liu, Chunyang Wang, Xuelian Liu, Bo Xiao, Guan Xi
Title: HyMamba: Mamba with Hybrid Geometry-Feature Coupling for Efficient Point Cloud Classification
Abstract:
Point cloud classification is one of the essential technologies for achieving intelligent perception of 3D environments by machines, its core challenge is to efficiently extract local and global features. Mamba leverages state space models (SSMs) for global point cloud modeling. Although prior Mamba-based point cloud processing methods pay attention to the limitation of its flattened sequence modeling mechanism in fusing local and global features, the critical issue of weakened local geometric relevance caused by decoupling geometric structures and features in the input patches remains not fully revealed, and both jointly limit local feature extraction. Therefore, we propose HyMamba, a geometry and feature coupled Mamba framework featuring: (1) Geometry-Feature Coupled Pooling (GFCP), which achieves physically interpretable geometric information coupling by dynamically aggregating adjacent geometric information into local features; (2) Collaborative Feature Enhancer (CoFE), which enhances sparse signal capture through cross-path feature hybridization while effectively integrating global and local contexts. We conducted extensive experiments on ModelNet40 and ScanObjectNN datasets. The results demonstrate that the proposed model achieves superior classification performance, particularly on the ModelNet40, where it elevates accuracy to 95.99% with merely 0.03M additional parameters. Furthermore, it attains 98.9% accuracy on the ModelNetFewShot dataset, validating its robust generalization capabilities under sparse samples. Our code and weights are available at https://github.com/L1277471578/HyMamba
中文摘要:HyMamba通过几何-特征耦合框架的GFCP和CoFE模块,强化局部几何关联与特征融合,在ModelNet40上以仅0.03M参数增量实现95.99%的分类精度突破。
English Summary: HyMamba introduces a geometry-feature coupled framework with GFCP and CoFE modules to enhance local geometric relevance and feature integration, achieving state-of-the-art classification accuracy of 95.99% on ModelNet40 with minimal parameter increase.

Authors:Guangqiang Li, M. Amine Atoui, Xiangshun Li
Title: Fault Diagnosis across Heterogeneous Domains via Self-Adaptive Temporal-Spatial Attention and Sample Generation
Abstract:
Deep learning methods have shown promising performance in fault diagnosis for multimode process. Most existing studies assume that the collected health state categories from different operating modes are identical. However, in real industrial scenarios, these categories typically exhibit only partial overlap. The incompleteness of the available data and the large distributional differences between the operating modes pose a significant challenge to existing fault diagnosis methods. To address this problem, a novel fault diagnosis model named self-adaptive temporal-spatial attention network (TSA-SAN) is proposed. First, inter-mode mappings are constructed using healthy category data to generate multimode samples. To enrich the diversity of the fault data, interpolation is performed between healthy and fault samples. Subsequently, the fault diagnosis model is trained using real and generated data. The self-adaptive instance normalization is established to suppress irrelevant information while retaining essential statistical features for diagnosis. In addition, a temporal-spatial attention mechanism is constructed to focus on the key features, thus enhancing the generalization ability of the model. The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art methods. The code will be available on Github at https://github.com/GuangqiangLi/TSA-SAN.
中文摘要:提出的自适应时空注意力网络(TSA-SAN)通过生成合成数据并采用自适应归一化和注意力机制,解决了多模态故障诊断中类别部分重叠和分布差异的问题,实验证明其性能显著优于现有方法。
English Summary: The proposed self-adaptive temporal-spatial attention network (TSA-SAN) addresses partial category overlap and distribution differences in multimode fault diagnosis by generating synthetic data and employing adaptive normalization with attention mechanisms, demonstrating superior performance over existing methods.

Authors:Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
Title: BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Abstract:
Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.
Chinese: 研究表明,简单的字符串匹配指标如BLEU可有效替代成本高昂的奖励模型来对齐语言模型与人类偏好,提出的BLEUBERI方法直接使用BLEU作为奖励函数,在指令遵循任务中实现与奖励模型指导的强化学习相竞争的性能,同时增强生成内容的事实依据性。
English: The study reveals that simple string-matching metrics like BLEU can effectively replace costly reward models for aligning language models with human preferences, introducing BLEUBERI, a method that leverages BLEU as a reward function to achieve competitive performance in instruction-following tasks while enhancing factual grounding.

Authors:Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen
Title: ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
Abstract:
Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?''. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git
中文摘要:本文提出ALLM4ADD框架,将音频深度伪造检测重新定义为音频问答任务,通过对音频大语言模型进行监督微调,实现了尤其在数据稀缺场景下更优的伪造音频检测性能。
English Summary: This paper introduces ALLM4ADD, a framework that reformulates audio deepfake detection as an audio question answering task and uses supervised fine-tuning of audio large language models to achieve superior performance, especially in data-scarce scenarios.

Authors:Vladimír Boža, Vladimír Macko
Title: Addition is almost all you need: Compressing neural networks with double binary factorization
Abstract:
Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary
中文摘要:本文提出双二元分解(DBF)方法,通过将稠密权重矩阵分解为两个带缩放向量的二元矩阵,在保持计算效率的同时,实现了优于或媲美现有先进方法的压缩率,并具备对压缩比的精细调控能力。
English Summary: The paper introduces Double Binary Factorization (DBF), a method that decomposes weight matrices into two binary matrices with scaling vectors, maintaining computational efficiency while achieving superior or competitive compression rates compared to state-of-the-art techniques, with the added benefit of fine-grained control over compression ratios.

Authors:Rui Wang, Shichun Yang, Yuyi Chen, Zhuoyang Li, Zexiang Tong, Jianyi Xu, Jiayi Lu, Xinjie Feng, Yaoguang Cao
Title: A Multi-modal Fusion Network for Terrain Perception Based on Illumination Aware
Abstract:
Road terrains play a crucial role in ensuring the driving safety of autonomous vehicles (AVs). However, existing sensors of AVs, including cameras and Lidars, are susceptible to variations in lighting and weather conditions, making it challenging to achieve real-time perception of road conditions. In this paper, we propose an illumination-aware multi-modal fusion network (IMF), which leverages both exteroceptive and proprioceptive perception and optimizes the fusion process based on illumination features. We introduce an illumination-perception sub-network to accurately estimate illumination features. Moreover, we design a multi-modal fusion network which is able to dynamically adjust weights of different modalities according to illumination features. We enhance the optimization process by pre-training of the illumination-perception sub-network and incorporating illumination loss as one of the training constraints. Extensive experiments demonstrate that the IMF shows a superior performance compared to state-of-the-art methods. The comparison results with single modality perception methods highlight the comprehensive advantages of multi-modal fusion in accurately perceiving road terrains under varying lighting conditions. Our dataset is available at: https://github.com/lindawang2016/IMF.
中文摘要:本文提出的光照感知多模态融合网络(IMF)通过光照特征动态整合外部与本体感知数据,有效提升了自动驾驶车辆在不同光照条件下对道路地形的感知能力,其性能优于现有先进方法。
English Summary: The proposed illumination-aware multi-modal fusion network (IMF) dynamically integrates exteroceptive and proprioceptive data using illumination features to significantly enhance autonomous vehicles' road terrain perception under varying lighting conditions, outperforming existing methods.

Authors:Changlun Li, Yao Shi, Chen Wang, Qiqi Duan, Runke Ruan, Weijie Huang, Haonan Long, Lijun Huang, Yuyu Luo, Nan Tang
Title: Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
Abstract:
Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to "time travel"-leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data-specifically data published after each model pretraining cutoff-to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions-including ticker-level analysis, investment decision-making, portfolio management, and risk control-reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3.7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https://github.com/HKUSTDial/DeepFund.
中文: 大型语言模型在金融任务中展现出潜力,但在实时基金管理中面临挑战,DeepFund实时基准测试显示,即便如DeepSeek-V3和Claude-3.7-Sonnet等先进模型也会出现交易亏损,这暴露了历史回测中存在信息泄露问题。
English: Large language models show promise in financial tasks but face challenges in real-time fund management, as demonstrated by DeepFund's live benchmark revealing trading losses even with advanced models like DeepSeek-V3 and Claude-3.7-Sonnet due to information leakage in historical back-testing.

Authors:Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
Title: GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Abstract:
To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/
本文提出GuardReasoner-VL,一种基于推理的视觉语言模型安全防护系统,通过在线强化学习训练模型进行审慎推理以改进内容审核决策,在F1分数上以19.27%的优势超越次优模型。
This paper introduces GuardReasoner-VL, a reasoning-based VLM safety model trained with online reinforcement learning to enhance moderation decisions through deliberate reasoning, achieving a 19.27% higher F1 score than competitors.

Authors:Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang
Title: Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
Abstract:
Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff
中文摘要:提出的MoMADiff框架结合掩码建模与扩散过程,通过帧级连续表示生成三维人体运动,相比现有方法在运动质量、文本遵循和关键帧控制方面表现更优。
English Summary: The proposed MoMADiff framework combines masked modeling with diffusion processes to generate 3D human motions from text, offering superior generalization and precise keyframe control compared to existing methods.

Authors:Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, Jizhe Zhou
Title: ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization
Abstract:
The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
中文摘要:ForensicHub作为首个全领域伪造图像检测与定位的统一基准,通过模块化架构整合四大检测领域,实现了跨领域模型比较与评估,为打破领域壁垒提供了关键解决方案。
English Summary: The ForensicHub benchmark addresses fragmentation in Fake Image Detection and Localization by unifying four domains through a modular architecture, implementing multiple baselines and benchmarks while providing key insights to advance the field.

Authors:Rees Chang, Angela Pak, Alex Guerra, Ni Zhan, Nick Richardson, Elif Ertekin, Ryan P. Adams
Title: Space Group Equivariant Crystal Diffusion
Abstract:
Accelerating inverse design of crystalline materials with generative models has significant implications for a range of technologies. Unlike other atomic systems, 3D crystals are invariant to discrete groups of isometries called the space groups. Crucially, these space group symmetries are known to heavily influence materials properties. We propose SGEquiDiff, a crystal generative model which naturally handles space group constraints with space group invariant likelihoods. SGEquiD-iff consists of an SE(3)-invariant, telescoping discrete sampler of crystal lattices; permutation-invariant, transformer-based autoregressive sampling of Wyckoff positions, elements, and numbers of symmetrically unique atoms; and space group equivariant diffusion of atomic coordinates. We show that space group equivariant vector fields automatically live in the tangent spaces of the Wyckoff positions. SGEquiDiff achieves state-of-the-art performance on standard benchmark datasets as assessed by quantitative proxy metrics and quantum mechanical calculations. Our code is available at https://github.com/rees-c/sgequidiff.
中文摘要:SGEquiDiff是一种创新的晶体生成模型,通过整合空间群对称性来优化晶体材料的逆向设计,在基准数据集上实现了最先进的性能表现。
English Summary: SGEquiDiff is a novel crystal generative model that incorporates space group symmetries to enhance the inverse design of crystalline materials, achieving state-of-the-art performance on benchmark datasets.

Authors:Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, Yun Ma
Title: RAGSynth: Synthetic Data for Robust and Faithful RAG Component Optimization
Abstract:
RAG can enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms, including vanilla, planning-based, and iterative RAG, are built upon 2 cores: the retriever, which should robustly select relevant documents across complex queries, and the generator, which should faithfully synthesize responses. However, existing retrievers rely heavily on public knowledge and struggle with queries of varying logical complexity and clue completeness, while generators frequently face fidelity problems. In this work, we introduce RAGSynth, a framework that includes a data construction modeling and a corresponding synthetic data generation implementation, designed to optimize retriever robustness and generator fidelity. Additionally, we present SynthBench, a benchmark encompassing 8 domain-specific documents across 4 domains, featuring diverse query complexities, clue completeness, and fine-grained citation granularity. Leveraging RAGSynth, we generate a large-scale synthetic dataset, including single and multi-hop. Extensive experiments demonstrate that the synthetic data significantly improves the robustness of the retrievers and the fidelity of the generators. Additional evaluations confirm that RAGSynth can also generalize well across different domains. By integrating the optimized retrievers into various RAG paradigms, we consistently observe enhanced RAG system performance. We have open-sourced the implementation on https://github.com/EachSheep/RAGSynth.
中文摘要:RAGSynth框架通过生成合成数据来增强检索器的鲁棒性和生成器的忠实度,实验证明其能显著提升RAG系统在多个领域的性能表现。
English Summary: RAGSynth is a framework that enhances RAG systems by generating synthetic data to improve retriever robustness and generator fidelity, with experiments showing significant performance gains across multiple domains.

Authors:Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan
Title: Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Abstract:
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
Chinese: 本研究发现在测试时计算扩展过程中,随着资源增加,复杂提示策略会逐渐被简单的思维链方法超越,并提出无需大量推理即可预测最优策略并提升扩展性能的高效方法。
English: This study reveals that as computational resources increase during test-time scaling, complex prompting strategies are outperformed by simple Chain-of-Thought, and proposes efficient methods to predict optimal strategies and enhance scaling performance without extensive inference.

Authors:Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Title: Relational Graph Transformer
Abstract:
Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
中文:关系图变换器(RelGT)是一种专为多表关系数据设计的新型架构,通过多元素标记化策略结合局部与全局注意力机制,有效解决了传统图神经网络在异质性和时序数据处理上的局限,在基准测试中性能显著优于现有方法。
English: The Relational Graph Transformer (RelGT) is a novel architecture designed to overcome the limitations of traditional Graph Neural Networks in handling heterogeneous and temporal relational data by employing a multi-element tokenization strategy and combining local and global attention mechanisms, achieving superior performance on benchmark tasks.

Authors:Mohammadtaha Bagherifard, Sahar Rajabi, Ali Edalat, Yadollah Yaghoobzadeh
Title: GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction
Abstract:
Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at https://github.com/saharsamr/Modular-LLM.
中文: 本文提出GenKnowSub模块化框架,通过从任务特定模块中减去通用知识LoRA来解耦通用知识与任务适配,无需重新训练即可动态组合模块,在多语言和跨语言场景中显著提升零样本泛化性能。
English: This paper introduces GenKnowSub, a modular framework that disentangles general knowledge from task-specific adaptations by subtracting a general-domain LoRA from task-specific modules, enabling dynamic combination for improved zero-shot generalization across languages and benchmarks without retraining.

Authors:Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei Luo
Title: M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection
Abstract:
Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve $mAP$ by 5.7\% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at https://github.com/wchao0601/M4-SAR.
中文: 提出的M4-SAR数据集和E2E-OSDet框架通过光学与SAR图像融合解决了单源遥感在复杂环境中的局限性,使检测精度在复杂环境下提升了5.7% mAP。
English: The proposed M4-SAR dataset and E2E-OSDet framework address the limitations of single-source remote sensing by enabling optical-SAR fusion, improving detection accuracy by 5.7% mAP in complex environments.

Authors:Congcong Zhu, Xiaoyan Xu, Jiayue Han, Jingrun Chen
Title: Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models
Abstract:
Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at https://github.com/SCAILab-USTC/PITA.
Chinese: 自回归偏微分方程基础模型存在误差累积问题,尤其对分布外数据表现不佳,而提出的物理信息时间对齐(PITA)框架通过自监督学习有效提升了其准确性和鲁棒性。
English: Auto-regressive PDE foundation models face error accumulation issues, especially with out-of-distribution data, but the proposed physics-informed temporal alignment (PITA) framework enhances their accuracy and robustness through self-supervised learning.

Authors:Saad Manzur, Bryan Vela, Brandon Vela, Aditya Agrawal, Lan-Anh Dang-Vu, David Li, Wayne Hayes
Title: PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation via Pose Lifting Networks
Abstract:
Reliable three-dimensional human pose estimation (3D HPE) remains challenging due to the differences in viewpoints, environments, and camera conventions among datasets. As a result, methods that achieve near-optimal in-dataset accuracy often degrade on unseen datasets. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups--conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. Measuring cross-dataset performance is a vital process, but extremely labor-intensive when done manually for human pose estimation. To address these challenges, we automate this evaluation using PoseBench3D, a standardized testing framework that enables consistent and fair cross-dataset comparisons on previously unseen data. PoseBench3D streamlines testing across four widely used 3D HPE datasets via a single, configurable interface. Using this framework, we re-evaluate 18 methods and report over 100 cross-dataset results under Protocol 1: MPJPE and Protocol 2: PA-MPJPE, revealing systematic generalization gaps and the impact of common preprocessing and dataset setup choices. The PoseBench3D code is found at: https://github.com/bryanjvela/PoseBench3D
中文: PoseBench3D是一个标准化框架,通过自动化三维人体姿态估计的跨数据集评估,在统一协议下测试了18种方法,揭示了系统性泛化差距。
English: PoseBench3D is a standardized framework that automates cross-dataset evaluation for 3D human pose estimation, revealing generalization gaps by testing 18 methods across four datasets under consistent protocols.

Authors:Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding
Title: InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
Abstract:
This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
中文: 本文介绍了InfantAgent-Next这一通用多模态智能体,它通过模块化架构整合工具型与纯视觉智能体,能够协同处理各类计算机交互任务,并在OSWorld基准测试中取得了7.27%的准确率。
English: This paper presents InfantAgent-Next, a versatile multimodal agent that integrates tool-based and vision agents within a modular framework to collaboratively handle diverse computer tasks, achieving a 7.27% accuracy on OSWorld benchmarks.

Authors:Filippo Leveni, Luca Magri, Cesare Alippi, Giacomo Boracchi
Title: Hashing for Structure-based Anomaly Detection
Abstract:
We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.
中文摘要:本研究提出一种高效异常检测方法,通过将数据嵌入高维偏好空间并采用局部敏感哈希技术,在保持最优性能的同时显著降低了计算成本。
English Summary: This study introduces an efficient anomaly detection method that identifies outliers by embedding data into a high-dimensional Preference Space and using Locality Sensitive Hashing to reduce computational costs while maintaining state-of-the-art performance.

Authors:Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang
Title: AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models
Abstract:
This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (https://github.com/JACKPURCELL/AutoRAN-public). (warning: this paper contains potentially harmful content generated by LRMs.)
中文: AutoRAN是首个自动化越狱攻击框架,通过利用弱对齐推理模型模拟目标模型的高级推理结构,成功揭示了大型推理模型的关键安全漏洞,在多种先进模型上实现了接近100%的攻击成功率。
English: AutoRAN is the first automated jailbreak framework that uses a weakly-aligned reasoning model to exploit vulnerabilities in advanced large reasoning models, achieving near-perfect success rates and revealing critical safety gaps in reasoning-based AI systems.

Authors:Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao
Title: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Abstract:
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.
Chinese: AutoThink是一种强化学习框架,使大型推理模型能够根据问题复杂度动态选择是否进行显式推理,在提升准确率的同时大幅降低计算消耗。
English: AutoThink is a reinforcement learning framework that enables large reasoning models to dynamically decide when to engage in explicit reasoning, achieving improved accuracy with significantly reduced computational overhead.

Authors:Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao
Title: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Abstract:
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.
Chinese: AutoThink是一种强化学习框架,使大型推理模型能够根据问题复杂度动态选择是否进行显式推理,在提升准确率的同时大幅降低计算消耗。
English: AutoThink is a reinforcement learning framework that enables large reasoning models to dynamically decide when to engage in explicit reasoning, achieving improved accuracy with significantly reduced computational overhead.

Authors:Kaifa Yang, Qi Yang, Zhu Li, Yiling Xu
Title: Textured mesh Quality Assessment using Geometry and Color Field Similarity
Abstract:
Textured mesh quality assessment (TMQA) is critical for various 3D mesh applications. However, existing TMQA methods often struggle to provide accurate and robust evaluations. Motivated by the effectiveness of fields in representing both 3D geometry and color information, we propose a novel point-based TMQA method called field mesh quality metric (FMQM). FMQM utilizes signed distance fields and a newly proposed color field named nearest surface point color field to realize effective mesh feature description. Four features related to visual perception are extracted from the geometry and color fields: geometry similarity, geometry gradient similarity, space color distribution similarity, and space color gradient similarity. Experimental results on three benchmark datasets demonstrate that FMQM outperforms state-of-the-art (SOTA) TMQA metrics. Furthermore, FMQM exhibits low computational complexity, making it a practical and efficient solution for real-world applications in 3D graphics and visualization. Our code is publicly available at: https://github.com/yyyykf/FMQM.
中文: 提出的场网格质量度量(FMQM)利用符号距离场和颜色场提取四种视觉感知特征,在纹理网格质量评估中优于现有最优方法,兼具高精度和低计算复杂度。
English: The proposed Field Mesh Quality Metric (FMQM) leverages signed distance and color fields to extract four visual perception features, outperforming state-of-the-art methods in textured mesh quality assessment with high accuracy and low computational complexity.

Authors:Ian Holmes, Min Chi
Title: Attention-Based Reward Shaping for Sparse and Delayed Rewards
Abstract:
Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.
Chinese: ARES是一种基于注意力机制的新型离线强化学习算法,通过转换器的注意力机制从稀疏或延迟奖励中生成密集奖励函数,能在各种环境中显著提升学习效率且无需在线交互。
English: ARES is a novel offline reinforcement learning algorithm that uses transformer attention mechanisms to create dense reward functions from sparse or delayed rewards, significantly improving learning efficiency across diverse environments without requiring online interaction.

Authors:Weiqin Wang, Yile Wang, Hui Huang
Title: Ranked Voting based Self-Consistency of Large Language Models
Abstract:
Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest "self-consistency" among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at https://github.com/szu-tera/RankedVotingSC.
中文: 本研究提出在每次推理过程中生成排序答案并采用排序投票方法,有效提升了思维链推理的可靠性,在多个数据集上的实验结果表明该方法优于现有基准。
English: The proposed method enhances chain-of-thought reasoning by generating ranked answers in each trial and applying ranked voting techniques, which significantly improves reasoning reliability and outperforms existing baselines across multiple datasets.

Authors:Manyu Li, Ruian He, Zixian Zhang, Weimin Tan, Bo Yan
Title: Unifying Segment Anything in Microscopy with Multimodal Large Language Model
Abstract:
Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose using MLLMs to guide SAM in learning microscopy crose-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to prompt SAM. Our method achieves performance improvements of 7.71% in Dice and 12.10% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 6.79% in Dice and 10.08% in SA across 10 out-ofdomain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.
Chinese: 本文提出uLLSAM方法,通过将多模态大语言模型的视觉语言知识融入Segment Anything模型,显著提升了其在生物医学图像分割中的准确性和跨领域泛化能力。
English: This paper introduces uLLSAM, a method that enhances the Segment Anything Model (SAM) by integrating vision-language knowledge from Multimodal Large Language Models, significantly improving its accuracy and generalization for biomedical image segmentation across both in-domain and out-of-domain datasets.

Authors:NingFeng Que, Xiaofei Wang, Jingjing Chen, Yixuan Jiang, Chao Li
Title: Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling
Abstract:
Spatial transcriptomics (ST) is a promising technique that characterizes the spatial gene profiling patterns within the tissue context. Comprehensive ST analysis depends on consecutive slices for 3D spatial insights, whereas the missing intermediate tissue sections and high costs limit the practical feasibility of generating multi-slice ST. In this paper, we propose C2-STi, the first attempt for interpolating missing ST slices at arbitrary intermediate positions between adjacent ST slices. Despite intuitive, effective ST interpolation presents significant challenges, including 1) limited continuity across heterogeneous tissue sections, 2) complex intrinsic correlation across genes, and 3) intricate cellular structures and biological semantics within each tissue section. To mitigate these challenges, in C2-STi, we design 1) a distance-aware local structural modulation module to adaptively capture cross-slice deformations and enhance positional correlations between ST slices, 2) a pyramid gene co-expression correlation module to capture multi-scale biological associations among genes, and 3) a cross-modal alignment module that integrates the ST-paired hematoxylin and eosin (H&E)-stained images to filter and align the essential cellular features across ST and H\&E images. Extensive experiments on the public dataset demonstrate our superiority over state-of-the-art approaches on both single-slice and multi-slice ST interpolation. Codes are available at https://github.com/XiaofeiWang2018/C2-STi.
中文: 本文提出C2-STi方法,通过设计跨切片变形感知、基因共表达关联和多模态对齐模块,有效解决空间转录组切片插值中的组织异质性等难题,在实验中展现出优越性能。
English: This paper introduces C2-STi, a novel method for interpolating missing spatial transcriptomics slices by addressing challenges such as tissue heterogeneity and gene correlations through specialized modules, demonstrating superior performance in experiments.

Authors:Sayed Mehedi Azim, Brian Corbett, Iman Dehzangi
Title: ROIsGAN: A Region Guided Generative Adversarial Framework for Murine Hippocampal Subregion Segmentation
Abstract:
The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either ΔFosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: https://github.com/MehediAzim/ROIsGAN
中文: 本研究提出了ROIsGAN这一新型生成对抗网络及四个海马体免疫组化数据集,实现了海马体亚区的自动分割,相比现有方法性能显著提升,为神经科学研究提供了基础资源。
English: This study introduces ROIsGAN, a novel generative adversarial network, along with four comprehensive hippocampal immunohistochemistry datasets to automate the segmentation of hippocampal subregions, achieving significant performance improvements over existing methods and providing foundational resources for neuroscience research.

Authors:Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, Yalin Zheng
Title: Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
Abstract:
Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code is available at https://github.com/davelailai/Sparse-ST-GCN.
中文: 稀疏时空图卷积网络在显著减少参数的同时保持了与密集网络相当的性能,而多级稀疏模型进一步提升了精度并降低了参数量。
English: Sparse ST-GCNs achieve comparable performance to dense networks with significantly fewer parameters, and multi-level sparsity models further enhance accuracy while reducing parameter counts.

Authors:Patara Trirat, Jae-Gil Lee
Title: MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices
Abstract:
The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
中文:MONAQ是一种创新框架,它利用大型语言模型将神经架构搜索重构为多目标查询任务,通过处理多模态时序数据和硬件约束,自动生成适用于边缘设备的高效模型,在实验中展现出优越性能。
English: MONAQ is a novel framework that transforms neural architecture search into multi-objective querying tasks using large language models, enabling automated discovery of efficient time-series analysis models for edge deployment while outperforming existing methods.

Authors:Xingye Cui, Junhai Luo, Jiakun Deng, Kexuan Li, Xiangyu Qiu, Zhenming Peng
Title: ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is critical in both civilian and military applications. However, the limited texture and structural information in infrared images makes accurate detection particularly challenging. Although recent deep learning-based methods have improved performance, their use of conventional convolution kernels limits adaptability to complex scenes and diverse targets. Moreover, pooling operations often cause feature loss and insufficient exploitation of image information. To address these issues, we propose an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection (ARFC-WAHNet). This network incorporates a multi-receptive field feature interaction convolution (MRFFIConv) module to adaptively extract discriminative features by integrating multiple convolutional branches with a gated unit. A wavelet frequency enhancement downsampling (WFED) module leverages Haar wavelet transform and frequency-domain reconstruction to enhance target features and suppress background noise. Additionally, we introduce a high-low feature fusion (HLFF) module for integrating low-level details with high-level semantics, and a global median enhancement attention (GMEA) module to improve feature diversity and expressiveness via global attention. Experiments on public datasets SIRST, NUDT-SIRST, and IRSTD-1k demonstrate that ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness, particularly under complex backgrounds. The code is available at https://github.com/Leaf2001/ARFC-WAHNet.
中文摘要:提出的ARFC-WAHNet通过自适应感受野卷积和小波注意力模块,有效解决了红外小目标检测中的特征提取不足问题,在复杂背景下相比现有方法展现出更优的检测精度和鲁棒性。
English Summary: The proposed ARFC-WAHNet introduces adaptive receptive field convolution and wavelet-attentive modules to overcome limitations in infrared small target detection, achieving superior accuracy and robustness across complex scenarios compared to existing methods.

Authors:Qifan Fu, Xu Chen, Muhammad Asad, Shanxin Yuan, Changjae Oh, Gregory Slabaugh
Title: Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View
Abstract:
High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ a single-view mesh-rendered image prior to enhancing gesture generation quality. However, the spatial complexity of hand gestures and the inherent limitations of single-view rendering make it difficult to capture complete gesture information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations. The source code is available at https://github.com/fuqifan/MUFEN.
中文摘要:提出的MUFEN框架采用多视角先验方法,通过双流编码器和特征融合模块克服单视角在手势生成中的局限性,利用多角度信息捕捉完整三维手势特征,实现了最先进的生成效果。
English Summary: The proposed MUFEN framework introduces a multi-view prior approach using a dual-stream encoder and feature fusion module to overcome single-view limitations in hand gesture generation, achieving state-of-the-art performance by capturing comprehensive 3D hand information.

Authors:Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
Title: Language Models Do Not Have Human-Like Working Memory
Abstract:
While Large Language Models (LLMs) exhibit remarkable reasoning abilities, we demonstrate that they lack a fundamental aspect of human cognition: working memory. Human working memory is an active cognitive system that enables not only the temporary storage of information but also its processing and utilization, enabling coherent reasoning and decision-making. Without working memory, individuals may produce unrealistic responses, exhibit self-contradictions, and struggle with tasks that require mental reasoning. Existing evaluations using N-back or context-dependent tasks fall short as they allow LLMs to exploit external context rather than retaining the reasoning process in the latent space. We introduce three novel tasks: (1) Number Guessing, (2) Yes-No Deduction, and (3) Math Magic, designed to isolate internal representation from external context. Across seventeen frontier models spanning four major model families, we consistently observe irrational or contradictory behaviors, indicating LLMs' inability to retain and manipulate latent information. Our work establishes a new benchmark for evaluating working memory in LLMs and highlights this limitation as a key bottleneck for advancing reliable reasoning systems. Code and prompts for the experiments are available at https://github.com/penguinnnnn/LLM-Working-Memory.
中文: 该研究揭示大语言模型缺乏人类工作记忆能力,导致推理任务中出现非理性行为,并提出了新的评估基准以检验这一关键认知缺陷。
English: The study reveals that large language models lack human-like working memory, leading to irrational behaviors in reasoning tasks, and introduces new benchmarks to evaluate this critical cognitive limitation.

Authors:Zeying Zhu, Jonathan Chamberlain, Kenny Wu, David Starobinski, Zaoxing Liu
Title: Approximation-First Timeseries Monitoring Query At Scale
Abstract:
Timeseries monitoring systems such as Prometheus play a crucial role in gaining observability of the underlying system components. These systems collect timeseries metrics from various system components and perform monitoring queries over periodic window-based aggregations (i.e., rule queries). However, despite wide adoption, the operational costs and query latency of rule queries remain high. In this paper, we identify major bottlenecks associated with repeated data scans and query computations concerning window overlaps in rule queries, and present PromSketch, an approximation-first query framework as intermediate caches for monitoring systems. It enables low operational costs and query latency, by combining approximate window-based query frameworks and sketch-based precomputation. PromSketch is implemented as a standalone module that can be integrated into Prometheus and VictoriaMetrics, covering 70% of Prometheus' aggregation over time queries. Our evaluation shows that PromSketch achieves up to a two orders of magnitude reduction in query latency over Prometheus and VictoriaMetrics, while lowering operational dollar costs of query processing by two orders of magnitude compared to Prometheus and by at least 4x compared to VictoriaMetrics with at most 5% average errors across statistics. The source code has been made available at https://github.com/Froot-NetSys/promsketch.
Chinese: PromSketch作为一种近似优先的查询框架,通过解决重复数据扫描和窗口重叠计算的关键瓶颈,大幅降低了时序监控系统的运营成本和查询延迟,实现了高达两个数量级的性能提升且误差极小。
English: PromSketch is an approximation-first query framework that significantly reduces operational costs and query latency in time-series monitoring systems by addressing bottlenecks from repeated data scans and overlapping window computations, achieving up to two orders of magnitude improvement with minimal error.

Authors:Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
Title: MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Abstract:
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
中文: 当前多模态模型因缺乏对数学图形的精细理解而在推理上受限,为此我们利用代码监督和大规模数据集开发了MathCoder-VL模型,在几何问题求解上超越了GPT-4o等模型,实现了开源模型的最优性能。
English: Current multimodal models struggle with mathematical reasoning due to a lack of detailed figure understanding, so we developed MathCoder-VL using code supervision and a large-scale dataset to achieve state-of-the-art performance, surpassing models like GPT-4o in geometry problem-solving.

Authors:Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li
Title: Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
Abstract:
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional gain in performance ceiling for both 7B and 32B models across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
Chinese: 该研究提出了一种通过三阶段流程将大型推理模型与演绎、归纳和溯因能力明确对齐的方法,使性能提升超过10%,并为跨数学、编程和科学领域的推理提供了可扩展且可靠的基础。
English: The study introduces a method to explicitly align large reasoning models with deduction, induction, and abduction through a three-stage pipeline, enhancing performance by over 10% and providing a scalable, reliable foundation for reasoning across various domains.

Authors:Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis
Title: Multi-Token Prediction Needs Registers
Abstract:
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
中文: MuToR是一种创新的多令牌预测方法,通过在输入序列中插入可学习的寄存器令牌来预测未来目标,具有参数增量极少、无需改变模型架构即可兼容现有预训练模型的特点,在语言和视觉任务的微调与预训练中均表现出优越性能。
English: MuToR is a novel multi-token prediction method that integrates learnable register tokens into input sequences to predict future targets, offering minimal parameter overhead, architectural compatibility with existing models, and enhanced performance across fine-tuning and pretraining scenarios in both language and vision tasks.

Authors:Jiaming Liang, Lihuan Dai, Xiaoqi Sheng, Xiangguang Chen, Chun Yao, Guihua Tao, Qibin Leng, Hongmin Cai, Xi Zhong
Title: HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation
Abstract:
Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.
Chinese: 本研究发布了首个大规模开源胃癌多模态MRI数据集GCM 2025,并提出了HWA-UNETR新型三维分割框架,该框架在Dice指标上以最高1.68%的优势超越现有方法,展现出卓越性能。
English: This study introduces the GCM 2025 dataset, the first large-scale open-source gastric cancer multimodal MRI collection, and proposes HWA-UNETR, a novel 3D segmentation framework that achieves superior performance with up to 1.68% higher Dice scores than existing methods.

Authors:Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, Junshan Zhang
Title: IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning
Abstract:
Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability and poor sample efficiency during the RL fine-tuning phase. In this work, we introduce IN-RIL, INterleaved Reinforcement learning and Imitation Learning, for policy fine-tuning, which periodically injects IL updates after multiple RL updates and hence can benefit from the stability of IL and the guidance of expert data for more efficient exploration throughout the entire fine-tuning process. Since IL and RL involve different optimization objectives, we develop gradient separation mechanisms to prevent destructive interference during \ABBR fine-tuning, by separating possibly conflicting gradient updates in orthogonal subspaces. Furthermore, we conduct rigorous analysis, and our findings shed light on why interleaving IL with RL stabilizes learning and improves sample-efficiency. Extensive experiments on 14 robot manipulation and locomotion tasks across 3 benchmarks, including FurnitureBench, OpenAI Gym, and Robomimic, demonstrate that \ABBR can significantly improve sample efficiency and mitigate performance collapse during online finetuning in both long- and short-horizon tasks with either sparse or dense rewards. IN-RIL, as a general plug-in compatible with various state-of-the-art RL algorithms, can significantly improve RL fine-tuning, e.g., from 12\% to 88\% with 6.3x improvement in the success rate on Robomimic Transport. Project page: https://github.com/ucd-dare/IN-RIL.
中文: 本文提出IN-RIL方法,通过在微调阶段交替使用模仿学习和强化学习,周期性注入模仿学习更新并分离冲突梯度,从而显著提升学习稳定性和样本效率。
English: This paper introduces IN-RIL, an interleaved learning method that combines imitation and reinforcement learning during fine-tuning to enhance stability and sample efficiency by periodically applying IL updates and separating conflicting gradients.

Authors:Andrei Arhire, Radu Timofte
Title: Learned Lightweight Smartphone ISP with Unpaired Data
Abstract:
The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at https://github.com/AndreiiArhire/Learned-Lightweight-Smartphone-ISP-with-Unpaired-Data .
中文: 本研究提出一种无需配对数据的可学习图像信号处理器训练方法,通过多判别器对抗训练保持内容结构并学习目标RGB数据集的色彩纹理,在移动设备上实现了优异的性能。
English: This study introduces an unpaired training method for a learnable Image Signal Processor (ISP) that eliminates the need for aligned data by using adversarial training with multiple discriminators to maintain content structure while adapting color and texture from target RGB datasets, achieving competitive performance on mobile devices.

Authors:Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
Title: Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
Abstract:
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
中文: LongRefiner作为高效的即插即用优化器,通过双级查询分析和自适应优化处理长文档中的冗余信息与噪声,在七个问答数据集上以十倍降低的计算成本实现了优异性能。
English: LongRefiner is an efficient plug-and-play refiner that tackles redundant information and noise in long-context RAG applications through dual-level query analysis and adaptive refinement, achieving competitive performance with 10x fewer computational costs across seven QA datasets.

Authors:Wenhao Ding, Choon Hwai Yap, Kangjun Ji, Simão Castro
Title: Two-Stage Generative Model for Intracranial Aneurysm Meshes with Morphological Marker Conditioning
Abstract:
A generative model for the mesh geometry of intracranial aneurysms (IA) is crucial for training networks to predict blood flow forces in real time, which is a key factor affecting disease progression. This need is necessitated by the absence of a large IA image datasets. Existing shape generation methods struggle to capture realistic IA features and ignore the relationship between IA pouches and parent vessels, limiting physiological realism and their generation cannot be controlled to have specific morphological measurements. We propose AneuG, a two-stage Variational Autoencoder (VAE)-based IA mesh generator. In the first stage, AneuG generates low-dimensional Graph Harmonic Deformation (GHD) tokens to encode and reconstruct aneurysm pouch shapes, constrained to morphing energy statistics truths. GHD enables more accurate shape encoding than alternatives. In the second stage, AneuG generates parent vessels conditioned on GHD tokens, by generating vascular centreline and propagating the cross-section. AneuG's IA shape generation can further be conditioned to have specific clinically relevant morphological measurements. This is useful for studies to understand shape variations represented by clinical measurements, and for flow simulation studies to understand effects of specific clinical shape parameters on fluid dynamics. Source code and implementation details are available at https://github.com/anonymousaneug/AneuG.
中文摘要:AneuG是一种基于变分自编码器的两阶段生成器,它通过图谐波变形标记编码动脉瘤囊形状并生成母血管,能生成具有生理真实性的颅内动脉瘤网格,并可控制临床相关的形态学测量参数。
English Summary: AneuG is a two-stage VAE-based generator that creates realistic intracranial aneurysm meshes by first encoding pouch shapes with Graph Harmonic Deformation tokens and then generating parent vessels, while enabling control over clinically relevant morphological measurements.

Authors:Kaivalya Rawal, Zihao Fu, Eoin Delaney, Chris Russell
Title: Evaluating Model Explanations without Ground Truth
Abstract:
There can be many competing and contradictory explanations for a single model prediction, making it difficult to select which one to use. Current explanation evaluation frameworks measure quality by comparing against ideal "ground-truth" explanations, or by verifying model sensitivity to important inputs. We outline the limitations of these approaches, and propose three desirable principles to ground the future development of explanation evaluation strategies for local feature importance explanations. We propose a ground-truth Agnostic eXplanation Evaluation framework (AXE) for evaluating and comparing model explanations that satisfies these principles. Unlike prior approaches, AXE does not require access to ideal ground-truth explanations for comparison, or rely on model sensitivity - providing an independent measure of explanation quality. We verify AXE by comparing with baselines, and show how it can be used to detect explanation fairwashing. Our code is available at https://github.com/KaiRawal/Evaluating-Model-Explanations-without-Ground-Truth.
Chinese Summary: 本文提出了AXE框架,无需依赖理想基准或模型敏感度即可评估模型解释,克服了现有方法的局限,并能检测解释中的公平性粉饰问题。
English Summary: The paper introduces AXE, a ground-truth agnostic framework for evaluating model explanations without relying on ideal references or model sensitivity, addressing limitations in current methods and enabling detection of explanation fairwashing.

Authors:Rui Melo, Claudia Mamede, Andre Catarino, Rui Abreu, Henrique Lopes Cardoso
Title: Are Sparse Autoencoders Useful for Java Function Bug Detection?
Abstract:
Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection
中文摘要:本研究表明稀疏自编码器无需微调即可利用预训练大语言模型的内部表征有效检测Java代码中的软件漏洞,最高达89%的F1分数,性能优于传统基线方法。
English Summary: This study demonstrates that Sparse Autoencoders can effectively detect software bugs in Java code using pre-trained LLMs' internal representations without fine-tuning, achieving up to 89% F1 score and outperforming traditional baselines.

Authors:Yile Wang, Zhanyu Shen, Hui Huang
Title: LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
Abstract:
Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.
Chinese: 本文提出LDIR,一种低维稠密且可解释的文本嵌入方法,在保持与黑盒模型相近性能的同时,以更少维度超越了现有可解释基线模型。
English: This paper introduces LDIR, a low-dimensional dense and interpretable text embedding method that achieves performance comparable to black-box models while outperforming existing interpretable baselines with significantly fewer dimensions.

Authors:Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong
Title: SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity
Abstract:
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer
中文: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer,具有线性时间复杂度,在视频任务中实现最先进性能,同时相比现有方法显著提升了能效。
English: SpikeVideoFormer is an efficient spike-driven video Transformer with linear temporal complexity that achieves state-of-the-art performance in video tasks while significantly improving energy efficiency over existing methods.

Authors:Jie Zhu, Jirong Zha, Ding Li, Leye Wang
Title: A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability
Abstract:
Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.
中文: 本文提出PartCrop方法,通过在表示空间中裁剪物体部件进行查询,实现了对视觉自监督模型的黑盒成员推理攻击,实验证明该方法具有普适有效性,并提出了相应的防御方案。
English: This paper introduces PartCrop, a unified membership inference attack method for visual self-supervised models that operates in a black-box setting by cropping object parts to detect training data, demonstrating effectiveness across various models and datasets while also proposing defense strategies.

Authors:Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv
Title: ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection
Abstract:
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at:https://github.com/fchest/ListenNet.
中文: 本文提出ListenNet轻量网络,通过三个创新组件捕捉脑电信号的时空依赖性来提升听觉注意力检测性能,在显著减少参数的同时实现了更优的识别效果。
English: This paper introduces ListenNet, a lightweight network that enhances auditory attention detection by capturing spatio-temporal dependencies in EEG signals through three novel components, achieving superior performance with significantly fewer parameters.

Authors:Gabriel S. Gama, Valdir Grassi
Title: Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning
Abstract:
Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The source code is available at https://github.com/Gabriel-SGama/UnitScal_vs_SMTOs.
中文: 专业多任务优化器(SMTOs)能有效平衡任务学习,与统一损失加权表现相当,同时固定权重在多任务学习中也展现出竞争力。
English: Specialized Multi-Task Optimizers (SMTOs) effectively balance task learning and can perform comparably to uniform loss weighting, while fixed weights also achieve competitive results in multi-task learning scenarios.

Authors:Alan Jeffares, Liyuan Liu
Title: An Introduction to Discrete Variational Autoencoders
Abstract:
Variational Autoencoders (VAEs) are well-established as a principled approach to probabilistic unsupervised learning with neural networks. Typically, an encoder network defines the parameters of a Gaussian distributed latent space from which we can sample and pass realizations to a decoder network. This model is trained to reconstruct its inputs and is optimized through the evidence lower bound. In recent years, discrete latent spaces have grown in popularity, suggesting that they may be a natural choice for many data modalities (e.g. text). In this tutorial, we provide a rigorous, yet practical, introduction to discrete variational autoencoders -- specifically, VAEs in which the latent space is made up of latent variables that follow a categorical distribution. We assume only a basic mathematical background with which we carefully derive each step from first principles. From there, we develop a concrete training recipe and provide an example implementation, hosted at https://github.com/alanjeffares/discreteVAE.
中文: 本教程提供了关于离散变分自编码器的严谨而实用的介绍,它使用分类分布的潜在变量,并包含从基本原理出发的详细推导、训练指南及示例实现。
English: This tutorial offers a practical and rigorous introduction to discrete variational autoencoders, which utilize categorical latent variables, and includes a detailed derivation from basic principles along with a training guide and sample implementation.

Authors:Julius Henke
Title: AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents
Abstract:
A recent area of increasing research is the use of Large Language Models (LLMs) in penetration testing, which promises to reduce costs and thus allow for higher frequency. We conduct a review of related work, identifying best practices and common evaluation issues. We then present AutoPentest, an application for performing black-box penetration tests with a high degree of autonomy. AutoPentest is based on the LLM GPT-4o from OpenAI and the LLM agent framework LangChain. It can perform complex multi-step tasks, augmented by external tools and knowledge bases. We conduct a study on three capture-the-flag style Hack The Box (HTB) machines, comparing our implementation AutoPentest with the baseline approach of manually using the ChatGPT-4o user interface. Both approaches are able to complete 15-25 % of the subtasks on the HTB machines, with AutoPentest slightly outperforming ChatGPT. We measure a total cost of \$96.20 US when using AutoPentest across all experiments, while a one-month subscription to ChatGPT Plus costs \$20. The results show that further implementation efforts and the use of more powerful LLMs released in the future are likely to make this a viable part of vulnerability management.
中文: 近期研究利用大型语言模型如GPT-4o开发自动化渗透测试工具AutoPentest,该工具在成本可控的条件下略优于人工操作ChatGPT,展现了未来结合更强大模型提升漏洞管理效能的潜力。
English: Recent research explores using Large Language Models like GPT-4o in autonomous penetration testing tools such as AutoPentest, which slightly outperforms manual ChatGPT use in cost-effective vulnerability assessments, showing potential for future improvements with advanced models.

Authors:Tuan Dung Nguyen, Duncan J. Watts, Mark E. Whiting
Title: Empirically evaluating commonsense intelligence in large language models with large-scale human judgments
Abstract:
Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
中文摘要:该摘要批评了静态基准测试假设人类常识同质化的做法,提出了一种通过将AI常识与人类多样性对齐的评估方法,发现较小模型在匹配人类判断差异方面常优于大型模型。
English Summary: The abstract critiques static benchmarks for assuming uniform human common sense and introduces a method to evaluate AI's common sense by aligning it with human diversity, revealing that smaller models often outperform larger ones in matching human judgment variability.

Authors:Yue Wang, Shuai Xu, Xuelin Zhu, Yicong Li
Title: MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning
Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.
中文: 提出的多阶段跨模态交互模型通过自适应聚合器将局部与全局视觉特征逐步融入文本表示,有效增强了CLIP对细粒度信息的感知能力,在三个基准数据集上验证了其优越性。
English: The proposed Multi-Stage Cross-modal Interaction (MSCI) model enhances CLIP's fine-grained perception by progressively integrating local and global visual features into textual representations through adaptive aggregators, demonstrating superior performance across three benchmark datasets.

Authors:Mengqiu Xu, Kaixin Chen, Heng Guo, Yixiang Huang, Ming Wu, Zhenwei Shi, Chuang Zhang, Jun Guo
Title: MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Abstract:
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.
中文: MFogHub 是首个整合了来自15个沿海雾区和六颗卫星的超过68,000个标注海洋雾样本的多区域多卫星数据集,能够全面评估检测与预测模型,并解决不同条件下的泛化难题。
English: MFogHub is the first multi-regional and multi-satellite dataset integrating over 68,000 annotated marine fog samples from 15 coastal regions and six satellites, enabling comprehensive evaluation of detection and forecasting models while addressing generalization challenges across diverse conditions.

Authors:Taian Guo, Haiyang Shen, JinSheng Huang, Zhengyang Mao, Junyu Luo, Binqi Chen, Zhuoru Chen, Luchen Liu, Bingyu Xia, Xuhui Liu, Yun Ma, Ming Zhang
Title: MASS: Muli-agent simulation scaling for portfolio construction
Abstract:
The application of LLM-based agents in financial investment has shown significant promise, yet existing approaches often require intermediate steps like predicting individual stock movements or rely on predefined, static workflows. These limitations restrict their adaptability and effectiveness in constructing optimal portfolios. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS), a novel framework that leverages multi-agent simulation for direct, end-to-end portfolio construction. At its core, MASS employs a backward optimization process to dynamically learn the optimal distribution of heterogeneous agents, enabling the system to adapt to evolving market regimes. A key finding enabled by our framework is the exploration of the scaling effect for portfolio construction: we demonstrate that as the number of agents increases exponentially (up to 512), the aggregated decisions yield progressively higher excess returns. Extensive experiments on a challenging, self-collected dataset from the 2023 Chinese A-share market show that MASS consistently outperforms seven state-of-the-art baselines. Further backtesting, stability analyses and the experiment on data leakage concerns validate its enhanced profitability and robustness. We have open-sourced our code, dataset, and training snapshots at https://github.com/gta0804/MASS/ to foster further research.
中文: 本文提出多智能体规模模拟(MASS)框架,通过动态多智能体模拟实现端到端投资组合构建,证明智能体数量指数增长能持续提升超额收益,在中国A股市场数据上验证了其优于现有方法的盈利能力和鲁棒性。
English: This paper introduces the Multi-Agent Scaling Simulation (MASS) framework, which uses dynamic multi-agent simulation for direct portfolio construction and demonstrates that exponentially increasing agents enhances returns, outperforming existing methods in robustness and profitability on Chinese market data.

Authors:Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev
Title: HandReader: Advanced Techniques for Efficient Fingerspelling Recognition
Abstract:
Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.
中文: 本文提出HandReader,一套包含三种架构的手指拼写识别系统,通过新颖时序模块处理RGB和关键点数据,在现有及新推出的俄语数据集上均取得最优性能。
English: This paper introduces HandReader, a set of three architectures for fingerspelling recognition that achieve state-of-the-art results by processing RGB and keypoint data through novel temporal modules, validated on both existing and a new Russian dataset.

Authors:Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang
Title: SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices
Abstract:
Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has low impact on performance, as reducing its capacity has minimal effect on overall throughput.In this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft models in speculative decoding within the offloading pipeline, and propose a planner to manage tensor placement and select optimal parameters. Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x. Our code is available at https://github.com/MobiSense/SpecOffload-public .
中文: SpecOffload通过在卸载机制中嵌入推测性解码,有效利用GPU潜在资源,在几乎零额外成本下大幅提升了推理吞吐量和GPU核心利用率。
English: SpecOffload enhances LLM inference efficiency on resource-limited devices by integrating speculative decoding with offloading, boosting GPU utilization and throughput significantly without extra cost.

Authors:Wenhao Shen, Wanqi Yin, Xiaofeng Yang, Cheng Chen, Chaoyue Song, Zhongang Cai, Lei Yang, Hao Wang, Guosheng Lin
Title: ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization
Abstract:
Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: https://github.com/shenwenhao01/ADHMR.
中文:提出的ADHMR框架通过偏好优化对齐基于扩散的人体网格恢复模型,利用新型HMR-Scorer评估预测并创建偏好数据集进行微调,实现了最先进的性能表现。
English: The proposed ADHMR framework enhances human mesh recovery by aligning a diffusion-based model through preference optimization, using a novel HMR-Scorer to evaluate predictions and create a preference dataset for fine-tuning, achieving state-of-the-art performance.

Authors:Yanbo Ding, Xirui Hu, Zhizhi Guo, Chi Zhang, Yali Wang
Title: MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation
Abstract:
Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are on: https://github.com/DINGYANB/MTVCrafter.
中文摘要:MTVCrafter提出首个直接建模原始3D运动序列的框架,通过4D运动令牌突破2D姿态图像的局限,在复杂3D世界中实现更灵活的人类图像动画,并在开放场景中展现出卓越的泛化能力。
English Summary: MTVCrafter introduces a novel framework using 4D motion tokens to overcome limitations of 2D pose images in human image animation, achieving state-of-the-art performance with superior generalization across diverse characters and scenarios.

Authors:Haozhe Luo, Ziyu Zhou, Zixin Shu, Aurélie Pahud de Mortanges, Robert Berke, Mauricio Reyes
Title: On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging
Abstract:
Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.
中文摘要:在医学影像中,人机对齐能持续缩小公平性差距并提升泛化能力,但过度对齐需采用校准策略以平衡专家指导与自动化效率。
English Summary: Human-AI alignment in medical imaging consistently reduces fairness gaps and improves generalization, though excessive alignment requires calibrated strategies to balance expert guidance with automated efficiency.

Authors:Dario Di Palma, Felice Antonio Merra, Maurizio Sfilio, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
Title: Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M
Abstract:
Large Language Models (LLMs) have become increasingly central to recommendation scenarios due to their remarkable natural language understanding and generation capabilities. Although significant research has explored the use of LLMs for various recommendation tasks, little effort has been dedicated to verifying whether they have memorized public recommendation dataset as part of their training data. This is undesirable because memorization reduces the generalizability of research findings, as benchmarking on memorized datasets does not guarantee generalization to unseen datasets. Furthermore, memorization can amplify biases, for example, some popular items may be recommended more frequently than others. In this work, we investigate whether LLMs have memorized public recommendation datasets. Specifically, we examine two model families (GPT and Llama) across multiple sizes, focusing on one of the most widely used dataset in recommender systems: MovieLens-1M. First, we define dataset memorization as the extent to which item attributes, user profiles, and user-item interactions can be retrieved by prompting the LLMs. Second, we analyze the impact of memorization on recommendation performance. Lastly, we examine whether memorization varies across model families and model sizes. Our results reveal that all models exhibit some degree of memorization of MovieLens-1M, and that recommendation performance is related to the extent of memorization. We have made all the code publicly available at: https://github.com/sisinflab/LLM-MemoryInspector
中文: 本研究调查了多种大型语言模型对MovieLens-1M数据集的记忆情况,发现所有模型均表现出不同程度的记忆效应,且这种记忆与推荐性能存在关联。
English: This study investigates the memorization of the MovieLens-1M dataset by various LLMs, revealing that all models exhibit some degree of memorization which correlates with their recommendation performance.

Authors:Xiang He, Dongcheng Zhao, Yang Li, Qingqun Kong, Xin Yang, Yi Zeng
Title: Incorporating brain-inspired mechanisms for multimodal learning in artificial intelligence
Abstract:
Multimodal learning enhances the perceptual capabilities of cognitive systems by integrating information from different sensory modalities. However, existing multimodal fusion research typically assumes static integration, not fully incorporating key dynamic mechanisms found in the brain. Specifically, the brain exhibits an inverse effectiveness phenomenon, wherein weaker unimodal cues yield stronger multisensory integration benefits; conversely, when individual modal cues are stronger, the effect of fusion is diminished. This mechanism enables biological systems to achieve robust cognition even with scarce or noisy perceptual cues. Inspired by this biological mechanism, we explore the relationship between multimodal output and information from individual modalities, proposing an inverse effectiveness driven multimodal fusion (IEMF) strategy. By incorporating this strategy into neural networks, we achieve more efficient integration with improved model performance and computational efficiency, demonstrating up to 50% reduction in computational cost across diverse fusion methods. We conduct experiments on audio-visual classification, continual learning, and question answering tasks to validate our method. Results consistently demonstrate that our method performs excellently in these tasks. To verify universality and generalization, we also conduct experiments on Artificial Neural Networks (ANN) and Spiking Neural Networks (SNN), with results showing good adaptability to both network types. Our research emphasizes the potential of incorporating biologically inspired mechanisms into multimodal networks and provides promising directions for the future development of multimodal artificial intelligence. The code is available at https://github.com/Brain-Cog-Lab/IEMF.
中文摘要:本研究受大脑动态整合机制启发,提出逆向有效性驱动的多模态融合策略,在多种任务和神经网络架构中实现高达50%的计算成本降低,同时提升模型性能。
English Summary: This study introduces an Inverse Effectiveness driven Multimodal Fusion (IEMF) strategy inspired by the brain's dynamic integration mechanism, achieving up to 50% computational cost reduction while improving performance across various tasks and neural network architectures.

Authors:Saikat Barua, Mostafizur Rahman, Shehenaz Khaled, Md Jafor Sadek, Rafiul Islam, Shahnewaz Siddique
Title: QuXAI: Explainers for Hybrid Quantum Machine Learning Models
Abstract:
The emergence of hybrid quantum-classical machine learning (HQML) models opens new horizons of computational intelligence but their fundamental complexity frequently leads to black box behavior that undermines transparency and reliability in their application. Although XAI for quantum systems still in its infancy, a major research gap is evident in robust global and local explainability approaches that are designed for HQML architectures that employ quantized feature encoding followed by classical learning. The gap is the focus of this work, which introduces QuXAI, an framework based upon Q-MEDLEY, an explainer for explaining feature importance in these hybrid systems. Our model entails the creation of HQML models incorporating quantum feature maps, the use of Q-MEDLEY, which combines feature based inferences, preserving the quantum transformation stage and visualizing the resulting attributions. Our result shows that Q-MEDLEY delineates influential classical aspects in HQML models, as well as separates their noise, and competes well against established XAI techniques in classical validation settings. Ablation studies more significantly expose the virtues of the composite structure used in Q-MEDLEY. The implications of this work are critically important, as it provides a route to improve the interpretability and reliability of HQML models, thus promoting greater confidence and being able to engage in safer and more responsible use of quantum-enhanced AI technology. Our code and experiments are open-sourced at: https://github.com/GitsSaikat/QuXAI
中文: 本文提出的QuXAI框架利用Q-MEDLEY增强混合量子-经典机器学习模型的可解释性,通过识别关键特征和分离噪声来提高模型透明度与可靠性。
English: This paper introduces QuXAI, a framework using Q-MEDLEY to enhance explainability in hybrid quantum-classical machine learning models by identifying influential features and separating noise, thereby improving their transparency and reliability.

Authors:Ziad Kheil, Lucas Robinet, Laurent Risser, Soleakhena Ken
Title: IMITATE: Image Registration with Context for unknown time frame recovery
Abstract:
In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at https://github.com/Kheil-Z/IMITATE .
Chinese: 本文提出了一种基于条件U-Net架构的新型图像配准方法,能够从已知图像估计未知条件相关图像,并在胸腹部4D-CT扫描中成功消除了移动肿瘤放疗中的重建伪影。
English: This paper introduces a novel image registration method using a conditional U-Net architecture to estimate unknown condition-related images from known ones, successfully eliminating reconstruction artifacts in 4D-CT scans for radiotherapy of moving tumors.

Authors:Jeonghyun Woo, Joyce Qu, Gururaj Saileshwar, Prashant J. Nair
Title: When Mitigations Backfire: Timing Channel Attacks and Defense for PRAC-Based RowHammer Mitigations
Abstract:
Per Row Activation Counting (PRAC) has emerged as a robust framework for mitigating RowHammer (RH) vulnerabilities in modern DRAM systems. However, we uncover a critical vulnerability: a timing channel introduced by the Alert Back-Off (ABO) protocol and Refresh Management (RFM) commands. We present PRACLeak, a novel attack that exploits these timing differences to leak sensitive information, such as secret keys from vulnerable AES implementations, by monitoring memory access latencies. To counter this, we propose Timing-Safe PRAC (TPRAC), a defense that eliminates PRAC-induced timing channels without compromising RH mitigation efficacy. TPRAC uses Timing-Based RFMs, issued periodically and independent of memory activity. It requires only a single-entry in-DRAM mitigation queue per DRAM bank and is compatible with existing DRAM standards. Our evaluations demonstrate that TPRAC closes timing channels while incurring only 3.4% performance overhead at the RH threshold of 1024.
中文摘要:PRACLeak利用PRAC框架中的时序漏洞泄露敏感数据,而TPRAC则以轻微性能代价有效消除了这些时序通道。
English Summary: PRACLeak exploits timing vulnerabilities in the PRAC framework to leak sensitive data, while TPRAC effectively eliminates these timing channels with minimal performance impact.

Authors:Ijazul Haq, Yingjie Zhang, Irfan Ali Khan
Title: PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language
Abstract:
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.
本研究评估了大型多模态模型在低资源普什图语光学字符识别任务中的表现,利用新开发的合成数据集PsOCR进行测试,结果表明Gemini整体表现最佳,而Qwen-7B在开源模型中最为突出。
This study assesses the performance of Large Multimodal Models on Optical Character Recognition for the low-resource Pashto language, using a newly developed synthetic dataset, PsOCR, and finds that Gemini leads overall while Qwen-7B excels among open-source models.

Authors:Jing-Cheng Pang, Kaiyuan Li, Yidi Wang, Si-Hang Yang, Shengyi Jiang, Yang Yu
Title: ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts
Abstract:
A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that leverage both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts; (2) diverse domains of environments covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions with varying complexity levels to facilitate language-conditioned policy learning. Through systematic evaluation of state-of-the-art offline RL algorithms, we observe that simply applying existing offline RL algorithms leads to suboptimal performance on unseen tasks, achieving 35.44% success rate in hard tasks in contrast to 64.37% of method training on real rollouts for hard tasks. This result highlights the need for algorithm advancements to better leverage LLM-imaginary rollouts. Additionally, we identify key opportunities for future research: including better utilization of imaginary rollouts, fast online adaptation and continual learning, and extension to multi-modal tasks. Our code is publicly available at https://github.com/LAMDA-RL/ImagineBench.
中文:ImagineBench作为首个综合性基准被提出,旨在解决利用真实和语言模型生成虚拟经验的离线强化学习算法缺乏标准化评估的问题,揭示了现有方法在未见任务上表现欠佳,并强调需改进算法以更好地利用合成数据。
English: ImagineBench is introduced as the first comprehensive benchmark to address the lack of standardized evaluation for offline reinforcement learning algorithms that utilize both real and LLM-generated imaginary rollouts, revealing suboptimal performance of existing methods and highlighting the need for advancements to better leverage synthetic data.

Authors:Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin
Title: ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
Abstract:
With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.
The increasing use of Large Language Models (LLMs) highlights the need for better understanding of real-world serving workloads, which this study addresses through comprehensive characterization and the introduction of ServeGen, a framework that improves workload generation and reduces under-provisioning by 50%.
English Summary:

Authors:Jianpeng Qi, Chao Liu, Xiao Zhang, Lei Wang, Rui Wang, Junyu Dong, Yanwei Yu
Title: A Survey on Open-Source Edge Computing Simulators and Emulators: The Computing and Networking Convergence Perspective
Abstract:
Edge computing, with its low latency, dynamic scalability, and location awareness, along with the convergence of computing and communication paradigms, has been successfully applied in critical domains such as industrial IoT, smart healthcare, smart homes, and public safety. This paper provides a comprehensive survey of open-source edge computing simulators and emulators, presented in our GitHub repository (https://github.com/qijianpeng/awesome-edge-computing), emphasizing the convergence of computing and networking paradigms. By examining more than 40 tools, including CloudSim, NS-3, and others, we identify the strengths and limitations in simulating and emulating edge environments. This survey classifies these tools into three categories: packet-level, application-level, and emulators. Furthermore, we evaluate them across five dimensions, ranging from resource representation to resource utilization. The survey highlights the integration of different computing paradigms, packet processing capabilities, support for edge environments, user-defined metric interfaces, and scenario visualization. The findings aim to guide researchers in selecting appropriate tools for developing and validating advanced computing and networking technologies.
中文: 本文系统综述了40余种开源边缘计算模拟器与仿真器,通过分类和功能评估为研究人员选择合适的工具开发先进计算与网络技术提供指导。
English: This paper surveys over 40 open-source edge computing simulators and emulators, categorizing them by functionality and evaluating their capabilities to guide researchers in tool selection for developing advanced computing and networking technologies.

Authors:Yuan Gao, Shaobo Xia, Sheng Nie, Cheng Wang, Xiaohuan Xi, Bisheng Yang
Title: APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds
Abstract:
Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.
Chinese: APCoTTA方法通过动态可训练层选择、基于熵的一致性损失和随机参数插值机制,解决了机载激光扫描点云连续测试时适应中的灾难性遗忘和误差累积问题,并在新构建的基准数据集上实现了mIoU指标的显著提升。
English: The APCoTTA method introduces dynamic trainable layer selection, entropy-based consistency loss, and random parameter interpolation to address catastrophic forgetting and error accumulation in continuous test-time adaptation for ALS point cloud segmentation, achieving significant mIoU improvements on newly established benchmarks.

Authors:Jiakun Deng, Kexuan Li, Xingye Cui, Jiaxuan Li, Chang Long, Tian Pu, Zhenming Peng
Title: CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at https://github.com/IDIP2025/CSPENet.
中文: 提出的CSPENet通过嵌入轮廓感知显著性先验的特殊模块,在基准数据集上实现了优于现有方法的红外小目标检测性能。
English: The proposed CSPENet enhances infrared small target detection by embedding contour-aware saliency priors through specialized modules, achieving superior performance on benchmark datasets compared to existing methods.

Authors:Zhe Shan, Lei Zhou, Liu Mao, Shaofan Chen, Chuanqiu Ren, Xia Xie
Title: Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset
Abstract:
In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at https://github.com/ShanZard/NRCD.
中文: 本研究提出了一种用于处理紧急情况的非配准变化检测遥感任务,界定了导致非配准问题的现实场景,相应转换了数据集,并证明现有先进方法在此类挑战下表现严重受损。
English: This study introduces a non-registration change detection method for remote sensing to handle emergencies, defines real-world scenarios causing non-registration issues, transforms datasets accordingly, and shows that current top methods suffer severely from these challenges.

Authors:Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Tianjiao Li, Chua Jia Jim Deryl, Mak Lee Onn, Gee Wah Ng, Kezhi Mao
Title: Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Abstract:
Prompt optimization (PO) provides a practical way to improve response quality when users lack the time or expertise to manually craft effective prompts. Existing methods typically rely on LLMs' self-generation ability to optimize prompts. However, due to limited downward compatibility, the instruction-heavy prompts generated by advanced LLMs can overwhelm lightweight inference models and degrade response quality, while also lacking interpretability due to implicit optimization. In this work, we rethink prompt optimization through the lens of explicit and interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, locally deployable prompt optimizer trained on our merit-guided prompt preference dataset generated by a lightweight LLM. MePO avoids online optimization, reduces privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment.The code, model and dataset can be found in https://github.com/MidiyaZhu/MePO
中文摘要:MePO提出了一种基于明确质量标准的提示优化方法,通过可解释的优化准则提升不同模型的响应质量,无需在线处理即可确保兼容性和高效性。
English Summary: MePO introduces a merit-guided prompt optimization approach that enhances response quality across various models by using explicit, interpretable criteria, ensuring compatibility and effectiveness without online processing.

Authors:Bin-Bin Gao, Yue Zhou, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, Chengjie Wang
Title: AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Abstract:
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.
中文摘要:AdaptCLIP是一种基于CLIP模型的新方法,通过三个简单适配器实现跨领域视觉异常检测,无需目标域微调即可在工业和医疗基准上取得最优性能。
English Summary: AdaptCLIP is a novel method that enhances CLIP models with three simple adapters for universal visual anomaly detection, achieving state-of-the-art performance across multiple domains without requiring fine-tuning on target data.

Authors:Yidan Wang, Yubing Ren, Yanan Cao, Binxing Fang
Title: From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
Abstract:
The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at https://github.com/redwyd/SymMark.
中文摘要:本文提出了一种多功能共生水印框架,通过结合基于logits和基于采样的方案,在LLM生成文本的检测性、鲁棒性、文本质量和安全性之间实现优化平衡,在多个数据集和模型上取得了最先进的性能表现。
English Summary: This paper introduces a versatile symbiotic watermarking framework that combines logits-based and sampling-based approaches to optimize the balance between detectability, robustness, text quality, and security in LLM-generated text, achieving state-of-the-art performance across multiple datasets and models.

Authors:Yidan Wang, Yanan Cao, Yubing Ren, Fang Fang, Zheng Lin, Binxing Fang
Title: PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Abstract:
Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at https://github.com/redwyd/PrivacyJailbreak.
中文摘要:本文提出PIG框架,通过越狱攻击从大语言模型中提取个人身份信息,其效果优于现有方法,揭示了大语言模型存在的严重隐私风险。
English Summary: This paper introduces PIG, a novel framework that leverages jailbreak attacks to extract Personally Identifiable Information from Large Language Models, demonstrating superior effectiveness over existing methods and highlighting significant privacy vulnerabilities.

Authors:Yanlong Yang, Jianan Liu, Guanxiong Luo, Hao Li, Euijoon Ahn, Mostafa Rahimi Azghadi, Tao Huang
Title: Unsupervised Radar Point Cloud Enhancement via Arbitrary LiDAR Guided Diffusion Prior
Abstract:
In industrial automation, radar is a critical sensor in machine perception. However, the angular resolution of radar is inherently limited by the Rayleigh criterion, which depends on both the radar's operating wavelength and the effective aperture of its antenna array.To overcome these hardware-imposed limitations, recent neural network-based methods have leveraged high-resolution LiDAR data, paired with radar measurements, during training to enhance radar point cloud resolution. While effective, these approaches require extensive paired datasets, which are costly to acquire and prone to calibration error. These challenges motivate the need for methods that can improve radar resolution without relying on paired high-resolution ground-truth data. Here, we introduce an unsupervised radar points enhancement algorithm that employs an arbitrary LiDAR-guided diffusion model as a prior without the need for paired training data. Specifically, our approach formulates radar angle estimation recovery as an inverse problem and incorporates prior knowledge through a diffusion model with arbitrary LiDAR domain knowledge. Experimental results demonstrate that our method attains high fidelity and low noise performance compared to traditional regularization techniques. Additionally, compared to paired training methods, it not only achieves comparable performance but also offers improved generalization capability. To our knowledge, this is the first approach that enhances radar points output by integrating prior knowledge via a diffusion model rather than relying on paired training data. Our code is available at https://github.com/yyxr75/RadarINV.
中文摘要:本文提出了一种无监督雷达增强算法,利用激光雷达引导的扩散模型作为先验知识,无需配对训练数据即可提升雷达点云分辨率,相比传统方法实现了更高保真度和更好泛化能力。
English Summary: This paper introduces an unsupervised radar enhancement algorithm that uses a LiDAR-guided diffusion model as a prior to improve radar point cloud resolution without requiring paired training data, achieving high fidelity and better generalization compared to traditional methods.

Authors:Spencer Lee, Daniel Appelo
Title: High-Order Hermite Optimization: Fast and Exact Gradient Computation in Open-Loop Quantum Optimal Control using a Discrete Adjoint Approach
Abstract:
This work introduces the High-Order Hermite Optimization (HOHO) method, an open-loop discrete adjoint method for quantum optimal control. Our method is the first of its kind to efficiently compute exact (discrete) gradients when using continuous, parameterized control pulses while solving the forward equations (e.g. Schrodinger's equation or the Linblad master equation) with an arbitrarily high-order Hermite Runge-Kutta method. The HOHO method is implemented in QuantumGateDesign$.$jl (https://github.com/leespen1/QuantumGateDesign.jl), an open-source software package for the Julia programming language, which we use to perform numerical experiments comparing the method to Juqbox$.$jl (https://github.com/LLNL/Juqbox.jl). For realistic model problems we observe speedups up to 775x.
中文: HOHO方法是一种创新的量子最优控制开环离散伴随方法,通过高阶Hermite Runge-Kutta方法高效计算精确梯度,在实际模拟中相比现有工具实现了最高775倍的加速效果。
English: The HOHO method is a novel open-loop discrete adjoint approach for quantum optimal control that efficiently computes exact gradients using high-order Hermite Runge-Kutta methods, achieving up to 775x speedup in realistic simulations compared to existing tools.

Authors:Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
Title: Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
Abstract:
As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
中文: 本文提出了一种针对大型语言模型的高效越狱技术,采用指数梯度下降与Bregman投影方法,相比现有技术具有更高的成功率和更强的效率优势。
English: This paper introduces an efficient jailbreaking technique for Large Language Models using exponentiated gradient descent with Bregman projection, which demonstrates higher success rates and greater efficiency compared to existing methods.

Authors:Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, Guanghui Ren
Title: EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
Abstract:
Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi-dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.
中文摘要:本文提出EWMBench基准测试,通过评估视觉一致性、运动准确性和语义对齐三个维度,专门用于测评具身世界模型,以推动生成具有物理真实感行为的人工智能发展。
English Summary: This paper introduces EWMBench, a specialized benchmark for evaluating embodied world models by assessing visual consistency, motion accuracy, and semantic alignment to advance physically realistic AI-generated behaviors.

Authors:Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian
Title: Analog Foundation Models
Abstract:
Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
中文: 本研究提出了一种可扩展的方法,使大语言模型能够适应噪声大、精度低的模拟硬件运行,让Phi-3和Llama-3.2等模型在保持与数字基准相当性能的同时,弥合了大模型与高效能模拟计算之间的鸿沟。
English: This work introduces a scalable method to adapt large language models for execution on noisy, low-precision analog hardware, enabling models like Phi-3 and Llama-3.2 to maintain performance comparable to digital baselines while bridging the gap between high-capacity LLMs and energy-efficient analog computing.

Authors:Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling
Title: Introducing voice timbre attribute detection
Abstract:
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.
Chinese: 本文提出语音音色属性检测任务,通过感知属性描述音色并构建基于说话人嵌入的框架,实验表明ECAPA-TDNN在已知说话人场景表现更佳,而FACodec在未知说话人场景具有更强的泛化能力。
English: This paper introduces voice timbre attribute detection (vTAD), which uses sensory attributes to describe voice perception and proposes a framework based on speaker embeddings, with experiments showing ECAPA-TDNN excels in seen scenarios while FACodec performs better in unseen ones.

Authors:Long Chen, Xiaotian Song, Yanan Sun
Title: LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models
Abstract:
Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2\% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS. The source code is available at https://github.com/lc783/LAS
Chinese: LAS通过解决激活异常值和非线性操作问题,实现了尖峰大语言模型的无损转换,在保持全脉冲驱动的同时不损失性能。
English: LAS introduces a loss-less conversion method for spiking large language models by addressing activation outliers and nonlinear operations, achieving full spiking performance without accuracy compromise.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
Abstract:
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
中文: 本文提出多样性感知奖励调整(DRA)方法,通过子模互信息将语义多样性融入奖励计算,在数学推理基准测试中以极低资源实现了最优性能。
English: This paper introduces Diversity-aware Reward Adjustment (DRA), a method that enhances reinforcement learning for language models by incorporating semantic diversity into rewards using Submodular Mutual Information, leading to state-of-the-art performance on mathematical reasoning benchmarks with minimal resources.

Authors:Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Yuxuan Liang
Title: Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era
Abstract:
Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.
中文总结:位置智能正通过深度学习和大型语言模型的地理空间表示学习技术发生变革,实现了自动化特征提取与跨模态推理,本综述系统回顾了该领域进展并提出了未来研究方向。
English Summary: Location Intelligence is being transformed by geospatial representation learning through deep learning and large language models, which enable automated feature extraction and cross-modal reasoning, with this survey providing a comprehensive review, taxonomy, and future research directions.

Authors:Nick Sunday
Title: Detecting Musical Deepfakes
Abstract:
The proliferation of Text-to-Music (TTM) platforms has democratized music creation, enabling users to effortlessly generate high-quality compositions. However, this innovation also presents new challenges to musicians and the broader music industry. This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset by classifying audio as either deepfake or human. To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset. Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network. In addition to presenting technical results, this work explores the ethical and societal implications of TTM platforms, arguing that carefully designed detection systems are essential to both protecting artists and unlocking the positive potential of generative AI in music.
中文: 本研究通过改进的音频频谱图开发了一种基于卷积神经网络的AI生成歌曲检测方法,强调此类检测系统对于平衡伦理问题与生成式AI在音乐中的积极潜力至关重要。
English: This study develops a CNN-based method using modified audio spectrograms to detect AI-generated songs, highlighting the need for such detection systems to balance ethical concerns with the benefits of generative AI in music.

Authors:Dhruv Ajmera
Title: An $\mathcal{O}(n)$ Space Construction of Superpermutations
Abstract:
A superpermutation is a sequence that contains every permutation of $n$ distinct symbols as a contiguous substring. For instance, a valid example for three symbols is a sequence that contains all six permutations. This paper introduces a new algorithm that constructs such sequences more efficiently than existing recursive and graph-theoretic methods. Unlike traditional techniques that suffer from scalability and factorial memory demands, the proposed approach builds superpermutations directly and compactly. This improves memory usage, enabling the construction of larger sequences previously considered impractical.
中文: 本文提出了一种新算法,通过直接紧凑地构建超排列,克服了现有方法在可扩展性和内存需求上的限制,从而更高效地生成这些序列。
English: This paper presents a new algorithm that constructs superpermutations more efficiently by building them directly and compactly, overcoming the scalability and memory limitations of existing methods.

Authors:Linbo Liu, Xinle Liu, Qiang Zhou, Lin Chen, Yihan Liu, Hoan Nguyen, Behrooz Omidvar-Tehrani, Xi Shen, Jun Huan, Omer Tripp, Anoop Deoras
Title: MigrationBench: Repository-Level Code Migration Benchmark from Java 8
Abstract:
With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on code generation and issue-resolution tasks. In contrast, we introduce a new coding benchmark MigrationBench with a distinct focus: code migration. MigrationBench aims to serve as a comprehensive benchmark for migration from Java $8$ to the latest long-term support (LTS) versions (Java $17$, $21$), including a full dataset and its subset selected with $5,102$ and $300$ repositories respectively. Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java $17$. For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves $62.33\%$ and $27.33\%$ success rate (pass@1) for minimal and maximal migration respectively. The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience/migrationbench-68125452fc21a4564b92b6c3 and https://github.com/amazon-science/MigrationBench respectively.
中文: 该摘要介绍了MigrationBench这一专注于Java 8到新版LTS版本代码迁移的新型基准测试,包含完整数据集和评估框架,通过实验证明大语言模型在仓库级代码迁移任务中能达到显著成功率。
English: The abstract introduces MigrationBench, a novel benchmark focused on code migration from Java 8 to newer LTS versions, featuring comprehensive datasets and an evaluation framework that demonstrates LLMs' effectiveness in repository-level migration tasks with significant success rates.

Authors:Nicola Marinello, Simen Cassiman, Jonas Heylen, Marc Proesmans, Luc Van Gool
Title: Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes
Abstract:
Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle's surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene completion further extends this task by also distinguishing object instances within the same class; both aspects are crucial for path planning and decision-making. However, 3D panoptic scene completion is currently underexplored. This work introduces a novel framework for 3D panoptic scene completion that extends existing 3D semantic scene completion models. We propose an Object Module and Panoptic Module that can easily be integrated with 3D occupancy and scene completion methods presented in the literature. Our approach leverages the available annotations in occupancy benchmarks, allowing individual object shapes to be learned as a differentiable problem. The code is available at https://github.com/nicolamarinello/OffsetOcc .
中文: 自动驾驶汽车需要全面的环境地图,因此本研究提出了一种新颖的3D全景场景补全框架,通过集成物体模块和全景模块来增强现有模型,从而提升路径规划与决策能力。
English: Autonomous vehicles require comprehensive environmental mapping, leading to the development of a novel 3D panoptic scene completion framework that enhances existing models with object and panoptic modules for improved path planning and decision-making.

Authors:Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao
Title: WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
Abstract:
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 53.4$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.
中文: 该摘要提出了WavReward,一种基于音频的奖励模型,通过利用音频语言模型和新数据集ChatReward-30K来评估口语对话系统的智商和情商,在准确性和主观测试中显著超越了以往模型。
English: The abstract introduces WavReward, a novel audio-based reward model designed to evaluate both the intelligence and emotional quotient of spoken dialogue systems by leveraging audio language models and a new dataset, ChatReward-30K, significantly outperforming previous models in accuracy and subjective tests.

Authors:Jeffrey Wen, Rizwan Ahmad, Philip Schniter
Title: Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems
Abstract:
In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don't know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems. Code is available at https://github.com/jwen307/quality_uq.
中文: 本研究结合了共形预测和近似后验采样,为成像逆问题中的全参考图像质量指标构建了具有保证的边界,并在图像去噪和加速磁共振成像中进行了验证。
English: This study introduces a method using conformal prediction and approximate posterior sampling to create guaranteed bounds on full-reference image quality metrics for imaging inverse problems, validated through denoising and MRI applications.

Authors:Dongyi He, Shiyang Li, Bin Jiang, He Yan
Title: Spec2VolCAMU-Net: A Spectrogram-to-Volume Model for EEG-to-fMRI Reconstruction based on Multi-directional Time-Frequency Convolutional Attention Encoder and Vision-Mamba U-Net
Abstract:
High-resolution functional magnetic resonance imaging (fMRI) is essential for mapping human brain activity; however, it remains costly and logistically challenging. If comparable volumes could be generated directly from widely available scalp electroencephalography (EEG), advanced neuroimaging would become significantly more accessible. Existing EEG-to-fMRI generators rely on plain Convolutional Neural Networks (CNNs) that fail to capture cross-channel time-frequency cues or on heavy transformer/Generative Adversarial Network (GAN) decoders that strain memory and stability. To address these limitations, we propose Spec2VolCAMU-Net, a lightweight architecture featuring a Multi-directional Time-Frequency Convolutional Attention Encoder for rich feature extraction and a Vision-Mamba U-Net decoder that uses linear-time state-space blocks for efficient long-range spatial modelling. We frame the goal of this work as establishing a new state of the art in the spatial fidelity of single-volume reconstruction, a foundational prerequisite for the ultimate aim of generating temporally coherent fMRI time series. Trained end-to-end with a hybrid SSI-MSE loss, Spec2VolCAMU-Net achieves state-of-the-art fidelity on three public benchmarks, recording Structural Similarity Index (SSIM) of 0.693 on NODDI, 0.725 on Oddball and 0.788 on CN-EPFL, representing improvements of 14.5%, 14.9%, and 16.9% respectively over previous best SSIM scores. Furthermore, it achieves competitive Signal-to-Noise Ratio (PSNR) scores, particularly excelling on the CN-EPFL dataset with a 4.6% improvement over the previous best PSNR, thus striking a better balance in reconstruction quality. The proposed model is lightweight and efficient, making it suitable for real-time applications in clinical and research settings. The code is available at https://github.com/hdy6438/Spec2VolCAMU-Net.
Chinese: 本研究提出的Spec2VolCAMU-Net轻量级模型,通过从脑电图数据重建高保真功能磁共振成像体积,在多个基准测试中实现了结构相似性和信噪比的显著提升,创造了该领域的最新性能记录。
English: This study introduces Spec2VolCAMU-Net, a lightweight model that sets a new state of the art in reconstructing high-fidelity fMRI volumes from EEG data, achieving significant improvements in structural similarity and signal-to-noise ratio across multiple benchmarks.

Authors:Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji
Title: Towards Fair In-Context Learning with Tabular Foundation Models
Abstract:
Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models -- TabPFNv2, TabICL, and TabDPT -- on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility (https://github.com/patrikken/Fair-TabICL)
中文摘要:本研究开创性地探讨了表格上下文学习中的公平性问题,发现基于不确定性的样本选择策略能在保持预测精度的同时,显著提升三个基础模型的群体公平性指标。
English Summary: This study pioneers the examination of fairness in tabular in-context learning, revealing that uncertainty-based sample selection effectively enhances group fairness metrics while preserving predictive accuracy across three foundation models.

Authors:Yuelin Zhang, Qingpeng Ding, Long Lei, Yongxuan Feng, Raymond Shing-Yan Tang, Shing Shin Cheng
Title: MrTrack: Register Mamba for Needle Tracking with Rapid Reciprocating Motion during Ultrasound-Guided Aspiration Biopsy
Abstract:
Ultrasound-guided fine needle aspiration (FNA) biopsy is a common minimally invasive diagnostic procedure. However, an aspiration needle tracker addressing rapid reciprocating motion is still missing. MrTrack, an aspiration needle tracker with a mamba-based register mechanism, is proposed. MrTrack leverages a Mamba-based register extractor to sequentially distill global context from each historical search map, storing these temporal cues in a register bank. The Mamba-based register retriever then retrieves temporal prompts from the register bank to provide external cues when current vision features are temporarily unusable due to rapid reciprocating motion and imaging degradation. A self-supervised register diversify loss is proposed to encourage feature diversity and dimension independence within the learned register, mitigating feature collapse. Comprehensive experiments conducted on both robotic and manual aspiration biopsy datasets demonstrate that MrTrack not only outperforms state-of-the-art trackers in accuracy and robustness but also achieves superior inference efficiency. Project page: https://github.com/PieceZhang/MrTrack
中文: MrTrack采用基于Mamba的寄存器机制,在超声引导活检中有效追踪快速往复运动的穿刺针,实现了更高的精确度、鲁棒性和推理效率。
English: MrTrack introduces a Mamba-based register mechanism to track aspiration needles during rapid motion, achieving superior accuracy, robustness, and efficiency in ultrasound-guided biopsies.

Authors:Han Sun, Yizhao Wang, Zhenning Zhou, Shuai Wang, Haibo Yang, Jingyuan Sun, Qixin Cao
Title: Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion
Abstract:
Recent studies have proved that imitation learning shows strong potential in the field of robotic manipulation. However, existing methods still struggle with precision manipulation task and rely on inefficient image/point cloud observations. In this paper, we explore to introduce SE(3) object pose into imitation learning and propose the pose-guided efficient imitation learning methods for robotic precise insertion task. First, we propose a precise insertion diffusion policy which utilizes the relative SE(3) pose as the observation-action pair. The policy models the source object SE(3) pose trajectory relative to the target object. Second, we explore to introduce the RGBD data to the pose-guided diffusion policy. Specifically, we design a goal-conditioned RGBD encoder to capture the discrepancy between the current state and the goal state. In addition, a pose-guided residual gated fusion method is proposed, which takes pose features as the backbone, and the RGBD features selectively compensate for pose feature deficiencies through an adaptive gating mechanism. Our methods are evaluated on 6 robotic precise insertion tasks, demonstrating competitive performance with only 7-10 demonstrations. Experiments demonstrate that the proposed methods can successfully complete precision insertion tasks with a clearance of about 0.01 mm. Experimental results highlight its superior efficiency and generalization capability compared to existing baselines. Code will be available at https://github.com/sunhan1997/PoseInsert.
中文: 本文提出了一种基于SE(3)物体位姿引导的模仿学习方法,结合RGBD数据实现了仅需少量演示即可完成高精度机器人插入任务,在效率和泛化能力上均优于现有方法。
English: This paper introduces a pose-guided imitation learning method using SE(3) object poses and RGBD data to achieve high-precision robotic insertion tasks with minimal demonstrations, outperforming existing approaches in efficiency and generalization.

Authors:Ma Changfeng, Bi Ran, Guo Jie, Wang Chongjun, Guo Yanwen
Title: Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
Abstract:
Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rendering method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance. The code is available at https://github.com/murcherful/GauPCRender}{https://github.com/murcherful/GauPCRender.
中文: 本文提出了一种新颖的点云渲染方法,通过从点云预测二维高斯分布,无需类别先验或密集输入即可实现跨多个数据集的先进渲染效果和泛化能力。
English: This paper introduces a novel point cloud rendering method that predicts 2D Gaussians directly from point clouds, eliminating the need for categorical priors or dense inputs while achieving state-of-the-art performance and generalization across multiple datasets.

Authors:Akash Kundu, Stefano Mangini
Title: TensorRL-QAS: Reinforcement learning with tensor networks for improved quantum architecture search
Abstract:
Variational quantum algorithms hold the promise to address meaningful quantum problems already on noisy intermediate-scale quantum hardware. In spite of the promise, they face the challenge of designing quantum circuits that both solve the target problem and comply with device limitations. Quantum architecture search (QAS) automates the design process of quantum circuits, with reinforcement learning (RL) emerging as a promising approach. Yet, RL-based QAS methods encounter significant scalability issues, as computational and training costs grow rapidly with the number of qubits, circuit depth, and hardware noise. To address these challenges, we introduce $\textit{TensorRL-QAS}$, an improved framework that combines tensor network methods with RL for QAS. By warm-starting the QAS with a matrix product state approximation of the target solution, TensorRL-QAS effectively narrows the search space to physically meaningful circuits and accelerates the convergence to the desired solution. Tested on several quantum chemistry problems of up to 12-qubit, TensorRL-QAS achieves up to a 10-fold reduction in CNOT count and circuit depth compared to baseline methods, while maintaining or surpassing chemical accuracy. It reduces classical optimizer function evaluation by up to 100-fold, accelerates training episodes by up to 98$\%$, and can achieve 50$\%$ success probability for 10-qubit systems, far exceeding the $<$1$\%$ rates of baseline. Robustness and versatility are demonstrated both in the noiseless and noisy scenarios, where we report a simulation of an 8-qubit system. Furthermore, TensorRL-QAS demonstrates effectiveness on systems on 20-qubit quantum systems, positioning it as a state-of-the-art quantum circuit discovery framework for near-term hardware and beyond.
变分量子算法有望在当前量子设备上解决重要问题,但面临设计高效电路的挑战,我们提出的TensorRL-QAS框架通过将张量网络与强化学习相结合,显著提升了可扩展性、精度和训练效率。
Variational quantum algorithms show potential for solving meaningful problems on current quantum devices but face challenges in designing efficient circuits, which our new TensorRL-QAS framework addresses by combining tensor networks with reinforcement learning to significantly enhance scalability, accuracy, and training efficiency.

Authors:Srinivas Ravuri, Yuan Xu, Martin Ludwig Zehetner, Ketan Motlag, Sahin Albayrak
Title: APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose Regression
Abstract:
Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:https://github.com/GT-ARC/APR-Transformer.
中文: APR-Transformer模型通过图像或激光雷达数据实现了最精确的三维位置与方向预测,在GNSS受限环境中经自动驾驶实车验证展现出卓越的可靠性。
English: The APR-Transformer model achieves state-of-the-art accuracy in predicting 3D position and orientation using image or LiDAR data, demonstrating robust performance in GNSS-denied environments through real-world autonomous vehicle testing.

Authors:Pengli Zhu, Yingji Fu, Nanguang Chen, Anqi Qiu
Title: Q-space Guided Collaborative Attention Translation Network for Flexible Diffusion-Weighted Images Synthesis
Abstract:
This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust its internal representations based on flexible q-space information, eliminating the need for fixed sampling schemes. Additionally, we introduce a range of task-specific constraints to preserve anatomical fidelity in DWI, enabling Q-CATN to accurately learn the intrinsic relationships between directional DWI signal distributions and q-space. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that Q-CATN outperforms existing methods, including 1D-qDL, 2D-qDL, MESC-SD, and QGAN, in estimating parameter maps and fiber tracts both quantitatively and qualitatively, while preserving fine-grained details. Notably, its ability to accommodate flexible q-space sampling highlights its potential as a promising toolkit for clinical and research applications. Our code is available at https://github.com/Idea89560041/Q-CATN.
中文: 本研究提出Q-CATN新型网络,通过协同注意力机制和灵活q空间采样,从结构MRI合成高质量DWI数据,在定量和定性评估中均优于现有方法,同时保持解剖细节。
English: This study introduces Q-CATN, a novel network that uses collaborative attention and flexible q-space sampling to synthesize high-quality DWI data from structural MRI, outperforming existing methods in both quantitative and qualitative evaluations while preserving anatomical details.

Authors:Chaoran Zhang, Chenhao Zhang, Zhaobo Xu, Qinghongbing Xie, Jinliang Hou, Pingfa Feng, Long Zeng
Title: Embodied intelligent industrial robotics: Concepts and techniques
Abstract:
In order to work more efficiently, accurately, reliably, and safely in industrial scenarios, robots should have at least general knowledge, working-environment knowledge, and operating-object knowledge. These pose significant challenges to existing embodied intelligent robotics (EIR) techniques. Thus, this paper first briefly reviews the history of industrial robotics and analyzes the limitations of mainstream EIR frameworks. Then, a knowledge-driven technical framework of embodied intelligent industrial robotics (EIIR) is proposed for various industrial environments. It has five modules: a world model, a high-level task planner, a low-level skill controller, a simulator, and a physical system. The development of techniques related to each module are also thoroughly reviewed, and recent progress regarding their adaption to industrial applications are discussed. A case study is given to demonstrate the newly proposed EIIR framework's applicability to real-world assembly system. Finally, the key challenges that EIIR encounters in industrial scenarios are summarized and future research directions are suggested. The authors believe that EIIR technology is shaping the next generation of industrial robotics and EIIR-based industrial systems supply a new technological paradigm for intelligent manufacturing. It is expected that this review could serve as a valuable reference for scholars and engineers that are interested in industrial embodied intelligence. Together, scholars can use this research to drive their rapid advancement and application of EIIR techniques. The interested authors would continue to track and contribute new studies in the project page https://github.com/jackyzengl/EIIR.
中文: 本文提出了一种知识驱动的具身智能工业机器人(EIIR)技术框架,通过五大核心模块突破现有技术局限,结合案例验证其工业适用性,并展望了未来研究方向。
English: This paper proposes a knowledge-driven embodied intelligent industrial robotics (EIIR) framework to overcome limitations in industrial robotics by integrating five key modules, demonstrating its real-world applicability through a case study while outlining future research directions.

Authors:Fan Xu, Wuyang Chen, Wei Gao
Title: On the Learning with Augmented Class via Forests
Abstract:
Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees' splitting, that is, augmented Gini impurity, a new splitting criterion is introduced to exploit some unlabeled data from testing distribution. We then develop the Learning with Augmented Class via Forests (short for LACForest) approach, which constructs shallow forests according to the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests via an optimization objective based on our augmented Gini impurity, which essentially utilizes the representation power of neural networks for forests. Theoretically, we present the convergence analysis for our augmented Gini impurity, and we finally conduct experiments to evaluate our approaches. The code is available at https://github.com/nju-xuf/LACForest.
中文摘要:本文提出LACForest方法,通过引入增强基尼不纯度这一新分裂准则,将训练数据中未出现的增强类信息融入决策森林,结合浅层森林和深度神经森林提升模型性能。
English Summary: This paper introduces LACForest, a method that enhances decision forests by incorporating an augmented class not present in training data through a novel splitting criterion called augmented Gini impurity, utilizing both shallow and deep neural forests for improved performance.

Authors:Fares Bougourzi, Abdenour Hadid
Title: Recent Advances in Medical Imaging Segmentation: A Survey
Abstract:
Medical imaging is a cornerstone of modern healthcare, driving advancements in diagnosis, treatment planning, and patient care. Among its various tasks, segmentation remains one of the most challenging problem due to factors such as data accessibility, annotation complexity, structural variability, variation in medical imaging modalities, and privacy constraints. Despite recent progress, achieving robust generalization and domain adaptation remains a significant hurdle, particularly given the resource-intensive nature of some proposed models and their reliance on domain expertise. This survey explores cutting-edge advancements in medical image segmentation, focusing on methodologies such as Generative AI, Few-Shot Learning, Foundation Models, and Universal Models. These approaches offer promising solutions to longstanding challenges. We provide a comprehensive overview of the theoretical foundations, state-of-the-art techniques, and recent applications of these methods. Finally, we discuss inherent limitations, unresolved issues, and future research directions aimed at enhancing the practicality and accessibility of segmentation models in medical imaging. We are maintaining a \href{https://github.com/faresbougourzi/Awesome-DL-for-Medical-Imaging-Segmentation}{GitHub Repository} to continue tracking and updating innovations in this field.
中文摘要:本综述探讨了生成式人工智能和基础模型等先进方法,以解决医学图像分割中泛化能力和数据可及性等长期挑战,同时讨论了未来研究方向并维护持续更新的资源库。
English Summary: This survey examines advanced methods like Generative AI and Foundation Models to tackle persistent challenges in medical image segmentation, such as generalization and data accessibility, while also discussing future research directions and maintaining an updated resource repository.

Authors:Bin-Bin Gao
Title: MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning
Abstract:
Zero- and few-shot visual anomaly segmentation relies on powerful vision-language models that detect unseen anomalies using manually designed textual prompts. However, visual representations are inherently independent of language. In this paper, we explore the potential of a pure visual foundation model as an alternative to widely used vision-language models for universal visual anomaly segmentation. We present a novel paradigm that unifies anomaly segmentation into change segmentation. This paradigm enables us to leverage large-scale synthetic image pairs, featuring object-level and local region changes, derived from existing image datasets, which are independent of target anomaly datasets. We propose a one-prompt Meta-learning framework for Universal Anomaly Segmentation (MetaUAS) that is trained on this synthetic dataset and then generalizes well to segment any novel or unseen visual anomalies in the real world. To handle geometrical variations between prompt and query images, we propose a soft feature alignment module that bridges paired-image change perception and single-image semantic segmentation. This is the first work to achieve universal anomaly segmentation using a pure vision model without relying on special anomaly detection datasets and pre-trained visual-language models. Our method effectively and efficiently segments any anomalies with only one normal image prompt and enjoys training-free without guidance from language. Our MetaUAS significantly outperforms previous zero-shot, few-shot, and even full-shot anomaly segmentation methods. The code and pre-trained models are available at https://github.com/gaobb/MetaUAS.
Chinese Summary: 本文提出MetaUAS纯视觉基础模型,通过将异常分割统一为变化检测任务,利用合成图像对进行训练,实现无需训练、不依赖语言的单样本异常检测能力。
English Summary: This paper introduces MetaUAS, a pure visual foundation model that achieves universal anomaly segmentation by reframing it as change detection, using synthetic image pairs for training and enabling training-free, language-independent anomaly detection with just one normal image prompt.

Authors:Bin-Bin Gao
Title: Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
Abstract:
Unsupervised reconstruction networks using self-attention transformers have achieved state-of-the-art performance for multi-class (unified) anomaly detection with a single model. However, these self-attention reconstruction models primarily operate on target features, which may result in perfect reconstruction for both normal and anomaly features due to high consistency with context, leading to failure in detecting anomalies. Additionally, these models often produce inaccurate anomaly segmentation due to performing reconstruction in a low spatial resolution latent space. To enable reconstruction models enjoying high efficiency while enhancing their generalization for unified anomaly detection, we propose a simple yet effective method that reconstructs normal features and restores anomaly features with just One Normal Image Prompt (OneNIP). In contrast to previous work, OneNIP allows for the first time to reconstruct or restore anomalies with just one normal image prompt, effectively boosting unified anomaly detection performance. Furthermore, we propose a supervised refiner that regresses reconstruction errors by using both real normal and synthesized anomalous images, which significantly improves pixel-level anomaly segmentation. OneNIP outperforms previous methods on three industry anomaly detection benchmarks: MVTec, BTAD, and VisA. The code and pre-trained models are available at https://github.com/gaobb/OneNIP.
中文: 提出的OneNIP方法通过单张正常图像提示重构正常特征并修复异常,显著提升了工业基准测试中的统一异常检测性能,同时改进了像素级分割精度。
English: The proposed OneNIP method enables unified anomaly detection by reconstructing normal features and restoring anomalies using a single normal image prompt, significantly enhancing performance across industry benchmarks while improving pixel-level segmentation accuracy.

Authors:Guan Gui, Bin-Bin Gao, Jun Liu, Chengjie Wang, Yunsheng Wu
Title: Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
Abstract:
Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we propose a few-shot Anomaly-driven Generation (AnoGen) method, which guides the diffusion model to generate realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, \eg, DRAEM and DseTSeg achieved a 5.8\% and 1.5\% improvement in AU-PR metric on segmentation task, respectively. The code and generated anomalous data are available at https://github.com/gaobb/AnoGen.
Chinese: 提出的AnoGen方法利用少量真实异常引导扩散模型生成逼真且多样的异常数据,显著提升了在MVTec等工业数据集上异常检测模型在分类和分割任务中的性能。
English: The proposed AnoGen method uses a few real anomalies to guide a diffusion model in generating realistic and diverse anomalies, significantly enhancing the performance of anomaly detection models in classification and segmentation tasks on industrial datasets like MVTec.

Authors:Derian Boer, Stephen Roth, Stefan Kramer
Title: Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge Bases
Abstract:
In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at https://github.com/kramerlab/FocusedRetriever .
中文: FocusedRetriever是一个基于半结构化知识库的模块化框架,通过整合实体搜索、查询生成和重排序组件,在多领域多跳问答任务中全面超越了现有最优方法。
English: FocusedRetriever is a modular framework using Semi-Structured Knowledge Bases that integrates entity search, query generation, and re-ranking to outperform state-of-the-art methods in multi-hop question answering across diverse domains.

Authors:Faruk Alpay
Title: Stable and Convexified Information Bottleneck Optimization via Symbolic Continuation and Entropy-Regularized Trajectories
Abstract:
The Information Bottleneck (IB) method frequently suffers from unstable optimization, characterized by abrupt representation shifts near critical points of the IB trade-off parameter, beta. In this paper, I introduce a novel approach to achieve stable and convex IB optimization through symbolic continuation and entropy-regularized trajectories. I analytically prove convexity and uniqueness of the IB solution path when an entropy regularization term is included, and demonstrate how this stabilizes representation learning across a wide range of \b{eta} values. Additionally, I provide extensive sensitivity analyses around critical points (beta) with statistically robust uncertainty quantification (95% confidence intervals). The open-source implementation, experimental results, and reproducibility framework included in this work offer a clear path for practical deployment and future extension of my proposed method.
中文: 本文通过符号延拓和熵正则化方法,提出了一种稳定且凸的信息瓶颈优化方法,在β参数临界点附近保证了解路径的唯一性,并提供了具有统计鲁棒性的不确定性量化。
English: This paper introduces a stable and convex optimization method for the Information Bottleneck problem using symbolic continuation and entropy regularization, ensuring unique solution paths and robust uncertainty quantification across beta values.

Authors:Jianlin Sun, Xiaolin Fang, Juwei Guan, Dongdong Gui, Teqi Wang, Tongxin Zhu
Title: DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection
Abstract:
The core challenge in Camouflage Object Detection (COD) lies in the indistinguishable similarity between targets and backgrounds in terms of color, texture, and shape. This causes existing methods to either lose edge details (such as hair-like fine structures) due to over-reliance on global semantic information or be disturbed by similar backgrounds (such as vegetation patterns) when relying solely on local features. We propose DRRNet, a four-stage architecture characterized by a "context-detail-fusion-refinement" pipeline to address these issues. Specifically, we introduce an Omni-Context Feature Extraction Module to capture global camouflage patterns and a Local Detail Extraction Module to supplement microstructural information for the full-scene context module. We then design a module for forming dual representations of scene understanding and structural awareness, which fuses panoramic features and local features across various scales. In the decoder, we also introduce a reverse refinement module that leverages spatial edge priors and frequency-domain noise suppression to perform a two-stage inverse refinement of the output. By applying two successive rounds of inverse refinement, the model effectively suppresses background interference and enhances the continuity of object boundaries. Experimental results demonstrate that DRRNet significantly outperforms state-of-the-art methods on benchmark datasets. Our code is available at https://github.com/jerrySunning/DRRNet.
中文摘要:DRRNet提出包含"上下文-细节-融合-优化"四阶段架构,通过全局伪装模式提取与局部细节补充的双重表征融合,结合逆向优化模块实现背景干扰抑制与边界连续性增强,在伪装目标检测任务中显著优于现有方法。
English Summary: DRRNet introduces a four-stage "context-detail-fusion-refinement" architecture with specialized modules for global context and local detail extraction, achieving superior camouflage object detection through dual-representation fusion and reverse refinement that suppresses background noise while preserving boundary continuity.

Authors:Zechao Guan, Feng Yan, Shuai Du, Lin Ma, Qingshan Liu
Title: TopoDiT-3D: Topology-Aware Diffusion Transformer with Bottleneck Structure for 3D Point Cloud Generation
Abstract:
Recent advancements in Diffusion Transformer (DiT) models have significantly improved 3D point cloud generation. However, existing methods primarily focus on local feature extraction while overlooking global topological information, such as voids, which are crucial for maintaining shape consistency and capturing complex geometries. To address this limitation, we propose TopoDiT-3D, a Topology-Aware Diffusion Transformer with a bottleneck structure for 3D point cloud generation. Specifically, we design the bottleneck structure utilizing Perceiver Resampler, which not only offers a mode to integrate topological information extracted through persistent homology into feature learning, but also adaptively filters out redundant local features to improve training efficiency. Experimental results demonstrate that TopoDiT-3D outperforms state-of-the-art models in visual quality, diversity, and training efficiency. Furthermore, TopoDiT-3D demonstrates the importance of rich topological information for 3D point cloud generation and its synergy with conventional local feature learning. Videos and code are available at https://github.com/Zechao-Guan/TopoDiT-3D.
中文:提出的TopoDiT-3D模型通过采用带感知重采样器的瓶颈结构整合全局拓扑信息,在三维点云生成中实现了优于现有方法的视觉质量、多样性和训练效率。
English: The proposed TopoDiT-3D model enhances 3D point cloud generation by incorporating global topological information through a bottleneck structure with Perceiver Resampler, improving visual quality, diversity, and training efficiency over existing methods.

Authors:Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li
Title: Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Abstract:
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.
中文: 本文提出的交互增强场景图推理框架通过融合空间基础与交互感知推理,结合长期记忆强化机制,有效解决了传统场景图在复杂交互推理中的局限性,显著提升了视觉语言模型对交互密集型任务的理解能力。
English: This paper introduces Interaction-augmented Scene Graph Reasoning (ISGR), a framework that overcomes traditional scene graphs' limitations by integrating spatial grounding with interaction-aware reasoning and long-term memory reinforcement, significantly enhancing vision-language models' performance on complex interaction understanding tasks.

Authors:Yicheng Rui, Yifan Xuan, Shuyue Zheng, Kexin Li, Kaiming Cui, Kai Xiao, Jie Zheng, Jun Kai Ng, Hongxuan Jiang, Fabo Feng, Qinghui Sun
Title: Architecture of Tianyu Software: Relative Photometry as a Case Study
Abstract:
Tianyu telescope, an one-meter robotic optical survey instrument to be constructed in Lenghu, Qinghai, China, is designed for detecting transiting exoplanets, variable stars and transients. It requires a highly automated, optimally distributed, easily extendable, and highly flexible software to enable the data processing for the raw data at rates exceeding 500MB/s. In this work, we introduce the architecture of the Tianyu pipeline and use relative photometry as a case to demonstrate its high scalability and efficiency. This pipeline is tested on the data collected from Muguang observatory and Xinglong observatory. The pipeline demonstrates high scalability, with most processing stages increasing in throughput as the number of consumers grows. Compared to a single consumer, the median throughput of image calibration, alignment, and flux extraction increases by 41%, 257%, and 107% respectively when using 5 consumers, while image stacking exhibits limited scalability due to I/O constraints. In our tests, the pipeline was able to detect two transiting sources. Besides, the pipeline captures variability in the light curves of nine known and two previously unknown variable sources in the testing data. Meanwhile, the differential photometric precision of the light curves is near the theoretical limitation. These results indicate that this pipeline is suitable for detecting transiting exoplanets and variable stars. This work builds the fundation for further development of Tianyu software. Code of this work is available at https://github.com/ruiyicheng/Tianyu_pipeline.
天宇数据处理流水线是一款高可扩展且高效的软件系统,专为高速处理天文数据而设计,成功探测到凌星系外行星和变星,并展现出接近理论极限的测光精度。
The Tianyu pipeline is a highly scalable and efficient software system designed for processing astronomical data at high speeds, successfully detecting transiting exoplanets and variable stars while demonstrating near-theoretical photometric precision.

Authors:Yuhang Wang, Abdulaziz Alhuraish, Shengming Yuan, Hao Zhou
Title: OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions
Abstract:
Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 62 production vehicle models, collected through extensive road testing in Tampa, Florida and global contributions from the Comma.ai driving community. The dataset spans a wide range of challenging scenarios, including complex road geometries, degraded lane markings, adverse weather, lighting conditions and surrounding traffic. The dataset is multimodal, comprising: i) full CAN bus streams, decoded using custom reverse-engineered DBC files to extract key LKA events (e.g., system disengagements, lane detection failures); ii) synchronized high-resolution dash-cam video; iii) real-time outputs from Openpilot, providing accurate estimates of road curvature and lane positioning; iv) enhanced scene annotations generated by Vision Language Models, describing lane visibility, pavement quality, weather, lighting, and traffic conditions. By integrating vehicle-internal signals with high-fidelity perception and rich semantic context, OpenLKA provides a comprehensive platform for benchmarking the real-world performance of production LKA systems, identifying safety-critical operational scenarios, and assessing the readiness of current road infrastructure for autonomous driving. The dataset is publicly available at: https://github.com/OpenLKA/OpenLKA.
Chinese: 本文推出了首个用于评估和改进车道保持辅助系统的开放大规模数据集OpenLKA,包含来自62款车型的400小时多模态驾驶数据,可全面评估实际性能并识别关键安全场景。
English: This paper introduces OpenLKA, the first open and large-scale dataset for evaluating and improving Lane Keeping Assist systems, featuring 400 hours of multimodal driving data from 62 vehicle models to benchmark real-world performance and identify safety-critical scenarios.

Authors:Wei-Long Tian, Peng Gao, Xiao Liu, Long Xu, Hamido Fujita, Hanan Aljuai, Mao-Li Wang
Title: Towards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking
Abstract:
In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reveal the security vulnerabilities of existing visual trackers through effective adversarial attacks has become a critical problem that needs to be addressed. To this end, we propose an adaptive meta-gradient adversarial attack (AMGA) method for visual tracking. This method integrates multi-model ensembles and meta-learning strategies, combining momentum mechanisms and Gaussian smoothing, which can significantly enhance the transferability and attack effectiveness of adversarial examples. AMGA randomly selects models from a large model repository, constructs diverse tracking scenarios, and iteratively performs both white- and black-box adversarial attacks in each scenario, optimizing the gradient directions of each model. This paradigm minimizes the gap between white- and black-box adversarial attacks, thus achieving excellent attack performance in black-box scenarios. Extensive experimental results on large-scale datasets such as OTB2015, LaSOT, and GOT-10k demonstrate that AMGA significantly improves the attack performance, transferability, and deception of adversarial examples. Codes and data are available at https://github.com/pgao-lab/AMGA.
中文: 提出的自适应元梯度对抗攻击(AMGA)方法通过整合多模型集成和元学习策略,显著提升了视觉跟踪中对抗样本的迁移性和攻击效果,在黑盒场景下实现了卓越性能。
English: The proposed Adaptive Meta-Gradient Adversarial Attack (AMGA) method enhances the transferability and effectiveness of adversarial examples in visual tracking by integrating multi-model ensembles and meta-learning, achieving superior performance in black-box scenarios.

Authors:Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji
Title: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Abstract:
In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.
Chinese: PRIOR是一种新颖的视觉语言预训练方法,通过基于纯文本参考模型的重要性分数对图像相关标记进行差异化损失加权,有效减少大型视觉语言模型的幻觉现象,相比标准的下一个标记预测方法实现了显著的性能提升。
English: PRIOR is a novel vision-language pre-training method that reduces hallucination in large vision-language models by prioritizing image-related tokens through differential loss weighting based on importance scores from a text-only reference model, achieving significant performance improvements over standard next-token prediction.

Authors:Yancheng Wang, Nebojsa Jojic, Yingzhen Yang
Title: Differentiable Channel Selection in Self-Attention For Person Re-Identification
Abstract:
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at https://github.com/Statistical-Deep-Learning/DCS-Attention.
中文: 本文提出的DCS-Attention模块通过可微分通道选择机制,基于信息瓶颈原理优化特征提取,在行人重识别任务中实现了最优性能。
English: The paper introduces the DCS-Attention module, which differentially selects informative channels based on the Information Bottleneck principle to enhance feature extraction and achieve state-of-the-art performance in person Re-ID tasks.

Authors:Kangxian Xie, Yufei Zhu, Kaiming Kuang, Li Zhang, Hongwei Bran Li, Mingchen Gao, Jiancheng Yang
Title: Template-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions
Abstract:
High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical treatment planning for lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to assess the reconstruction comprehensively. Further, due to the absence of publicly available shape datasets to benchmark reconstruction algorithms, we developed a shape dataset named Lung3D, including the 3D models of 800 labeled pulmonary segments and the corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at https://github.com/M3DV/ImPulSe.
中文摘要:本研究提出的基于神经隐式函数的方法,通过可学习模板变形实现解剖感知的精确肺段三维重建,其计算效率和重建质量优于现有方法,并辅以新型Lung3D数据集和临床评估指标。
English Summary: The proposed neural implicit function-based method enables high-quality, anatomy-aware 3D pulmonary segment reconstruction with superior computational efficiency and precision compared to existing approaches, supported by a new Lung3D dataset and clinical evaluation metrics.

Authors:Marina Popova, Iaroslav Chelombitko, Aleksey Komissarov
Title: When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes
Abstract:
The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.
中文: 本研究显示,字节对编码(BPE)标记化虽能有效压缩端粒到端粒灵长类基因组中的重复序列,但由于其对物种特异性重复元件的敏感性,限制了在比较基因组学中的应用,表现为共享词汇量急剧减少且无法还原已知系统发育关系。
English: This study demonstrates that Byte Pair Encoding (BPE) tokenization effectively compresses repetitive sequences in telomere-to-telomere primate genomes but proves limited for comparative genomics due to its sensitivity to species-specific repetitive elements, which disrupts phylogenetic accuracy and shared vocabulary across assemblies.

Authors:Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji
Title: Behind Maya: Building a Multilingual Vision Language Model
Abstract:
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
中文:近期大型视觉语言模型在主流语言上表现优异,但在低资源语言和文化多样性方面存在不足,为此我们推出了开源多语言模型Maya,它通过支持八种语言的多语言数据集和模型,显著提升了跨文化视觉语言任务的理解能力。
English: Recent advances in large Vision-Language Models have excelled in major languages but struggle with low-resource languages and cultural diversity, prompting the introduction of Maya, an open-source multilingual VLM that enhances cross-cultural understanding through a multilingual dataset and model supporting eight languages.

Authors:Michael Majurski, Cynthia Matuszek
Title: Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora
Abstract:
Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users may ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is rapidly being outpaced by the size and scope of the models under evaluation. Having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages the same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions producing a Spearman ranking correlation of 0.97 and a benchmark evaluation Pearson accuracy correlation of 0.75. This novel approach supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on two recent arXiv preprints, discovering a surprisingly strong performance from Gemma-3 models on open-ended questions. Code is available at https://github.com/mmajurski/grounded-synth-lm-benchmark
中文:本文提出了一种利用语言模型和基础文档自动构建基于事实的合成基准测试方法,该方法与人工评估高度相关,并能跨领域对模型能力进行诊断性评估。
English: This paper introduces an automated method for creating fact-based synthetic benchmarks using language models and grounding documents, which correlates strongly with human-curated evaluations and enables diagnostic assessment of model capabilities across domains.

Authors:Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Haim Permuter, Flavio P. Calmon
Title: Optimized Couplings for Watermarking Large Language Models
Abstract:
Large-language models (LLMs) are now able to produce text that is, in many cases, seemingly indistinguishable from human-generated content. This has fueled the development of watermarks that imprint a ``signal'' in LLM-generated text with minimal perturbation of an LLM's output. This paper provides an analysis of text watermarking in a one-shot setting. Through the lens of hypothesis testing with side information, we formulate and analyze the fundamental trade-off between watermark detection power and distortion in generated textual quality. We argue that a key component in watermark design is generating a coupling between the side information shared with the watermark detector and a random partition of the LLM vocabulary. Our analysis identifies the optimal coupling and randomization strategy under the worst-case LLM next-token distribution that satisfies a min-entropy constraint. We provide a closed-form expression of the resulting detection rate under the proposed scheme and quantify the cost in a max-min sense. Finally, we provide an array of numerical results, comparing the proposed scheme with the theoretical optimum and existing schemes, in both synthetic data and LLM watermarking. Our code is available at https://github.com/Carol-Long/CC_Watermark
中文摘要:本文分析了大语言模型文本水印技术,重点研究了在最小熵约束的最坏情况下,检测能力与文本质量失真之间的最优权衡关系。
English Summary: This paper analyzes text watermarking for large-language models, focusing on the optimal trade-off between detection power and text quality distortion under a worst-case scenario with min-entropy constraints.

Authors:Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu, Zhiwen Fan, Frank Hao Yang, Yuhao Kang, Ross Greer, Chenxi Liu, Eun Hak Lee, Xuan Di, Xinyue Ye, Liu Ren, Alois Knoll, Xiaopeng Li, Shuiwang Ji, Masayoshi Tomizuka, Marco Pavone, Tianbao Yang, Jing Du, Ming-Hsuan Yang, Hua Wei, Ziran Wang, Yang Zhou, Jiachen Li, Zhengzhong Tu
Title: Generative AI for Autonomous Driving: Frontiers and Opportunities
Abstract:
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.
中文: 生成式人工智能通过内容创作与推理能力正彻底变革自动驾驶技术,为实现L5级全自动驾驶提供了最有前景的路径,同时需应对安全性、泛化能力和实际应用等关键挑战。
English: Generative AI is revolutionizing autonomous driving by enabling advanced content creation and reasoning, offering the most promising path toward achieving Level 5 autonomy while addressing key challenges in safety, generalization, and implementation.

Authors:Ippokratis Koukoulis, Ilias Syrigos, Thanasis Korakis
Title: Self-Supervised Transformer-based Contrastive Learning for Intrusion Detection Systems
Abstract:
As the digital landscape becomes more interconnected, the frequency and severity of zero-day attacks, have significantly increased, leading to an urgent need for innovative Intrusion Detection Systems (IDS). Machine Learning-based IDS that learn from the network traffic characteristics and can discern attack patterns from benign traffic offer an advanced solution to traditional signature-based IDS. However, they heavily rely on labeled datasets, and their ability to generalize when encountering unseen traffic patterns remains a challenge. This paper proposes a novel self-supervised contrastive learning approach based on transformer encoders, specifically tailored for generalizable intrusion detection on raw packet sequences. Our proposed learning scheme employs a packet-level data augmentation strategy combined with a transformer-based architecture to extract and generate meaningful representations of traffic flows. Unlike traditional methods reliant on handcrafted statistical features (NetFlow), our approach automatically learns comprehensive packet sequence representations, significantly enhancing performance in anomaly identification tasks and supervised learning for intrusion detection. Our transformer-based framework exhibits better performance in comparison to existing NetFlow self-supervised methods. Specifically, we achieve up to a 3% higher AUC in anomaly detection for intra-dataset evaluation and up to 20% higher AUC scores in inter-dataset evaluation. Moreover, our model provides a strong baseline for supervised intrusion detection with limited labeled data, exhibiting an improvement over self-supervised NetFlow models of up to 1.5% AUC when pretrained and evaluated on the same dataset. Additionally, we show the adaptability of our pretrained model when fine-tuned across different datasets, demonstrating strong performance even when lacking benign data from the target domain.
中文: 本文提出了一种基于Transformer编码器的自监督对比学习方法,通过自动学习数据包序列表示来提升入侵检测性能,在异常检测和监督学习任务中均优于传统方法。
English: This paper introduces a self-supervised contrastive learning method using transformer encoders to enhance intrusion detection by automatically learning packet sequence representations, achieving superior performance in both anomaly detection and supervised learning tasks compared to traditional approaches.

Authors:Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
Title: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Abstract:
Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.
中文摘要:CodePDE首次提出通过大语言模型生成代码来求解偏微分方程的推理框架,无需特定任务调优即实现超越人类的表现,同时具备自主推理与调试能力。
English Summary: CodePDE introduces a novel framework that uses large language models to generate PDE solvers through code generation, achieving superior performance without task-specific training while enabling reasoning and self-improvement capabilities.

Authors:Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal
Title: HealthBench: Evaluating Large Language Models Towards Improved Human Health
Abstract:
We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.
中文: HealthBench是一个开源基准测试,通过多轮对话评估大型语言模型在医疗领域的性能与安全性,采用医生制定的多样化评分标准,覆盖多种健康场景和行为维度,展现了模型性能的持续提升。
English: HealthBench is an open-source benchmark for evaluating the performance and safety of large language models in healthcare through multi-turn conversations, using physician-created rubrics across diverse health contexts and behavioral dimensions, showing steady model improvements over time.

Authors:Fanyu Meng, Ziwen Kan, Shahbaz Rezaei, Zhaodan Kong, Xin Chen, Xin Liu
Title: Implet: A Post-hoc Subsequence Explainer for Time Series Models
Abstract:
Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model's predictions, providing enhanced interpretability beyond traditional feature-attribution methods. Based on it, we propose a cohort-based (group-level) explanation framework designed to further improve the conciseness and interpretability of our explanations. We evaluate Implet on several standard time-series classification benchmarks, demonstrating its effectiveness in improving interpretability. The code is available at https://github.com/LbzSteven/implet
中文: 本文提出Implet,一种后置解释器,通过识别关键时间片段为时间序列模型生成精确的子序列级解释,并引入基于群体的解释框架以增强可解释性,在标准基准测试中验证了其有效性。
English: This paper presents Implet, a post-hoc explainer that generates precise subsequence-level explanations for time series models by identifying key temporal segments and offers a cohort-based framework to enhance interpretability, validated on standard benchmarks.

Authors:Abdolmehdi Behroozi, Chaopeng Shen and, Daniel Kifer
Title: Sensitivity-Constrained Fourier Neural Operators for Forward and Inverse Problems in Parametric Differential Equations
Abstract:
Parametric differential equations of the form du/dt = f(u, x, t, p) are fundamental in science and engineering. While deep learning frameworks such as the Fourier Neural Operator (FNO) can efficiently approximate solutions, they struggle with inverse problems, sensitivity estimation (du/dp), and concept drift. We address these limitations by introducing a sensitivity-based regularization strategy, called Sensitivity-Constrained Fourier Neural Operators (SC-FNO). SC-FNO achieves high accuracy in predicting solution paths and consistently outperforms standard FNO and FNO with physics-informed regularization. It improves performance in parameter inversion tasks, scales to high-dimensional parameter spaces (tested with up to 82 parameters), and reduces both data and training requirements. These gains are achieved with a modest increase in training time (30% to 130% per epoch) and generalize across various types of differential equations and neural operators. Code and selected experiments are available at: https://github.com/AMBehroozi/SC_Neural_Operators
Chinese: SC-FNO 通过引入基于敏感性的正则化策略,解决了 FNO 在逆问题和敏感性估计方面的不足,实现了更高的精度、可扩展性及更低的数据需求,同时训练时间仅适度增加。
English: SC-FNO introduces sensitivity-based regularization to overcome FNO's limitations in inverse problems and sensitivity estimation, achieving higher accuracy, scalability, and reduced data needs with a moderate training time increase.

Authors:Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai
Title: Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Abstract:
The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The spatial processor, designed as a plug-and-play component, can be initialized with pre-trained 3D detectors to improve 3D perception. Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task. The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey.
Chinese: 本研究提出了DriveMonkey框架,通过集成空间处理与3D感知能力来增强大型视觉语言模型在自动驾驶中的应用,利用NuInteract数据集显著提升了场景理解和3D视觉定位性能。
English: This study introduces DriveMonkey, a framework that enhances large visual-language models for autonomous driving by integrating spatial processing with 3D perception, using the NuInteract dataset to improve scene understanding and 3D visual grounding.

Authors:Xiaolei Qin, Di Wang, Jing Zhang, Fengxiang Wang, Xin Su, Bo Du, Liangpei Zhang
Title: TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series
Abstract:
Satellite image time series (SITS) provide continuous observations of the Earth's surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objects. This limitation hinders their effectiveness in downstream tasks. To overcome this challenge, we propose TiMo, a novel hierarchical vision transformer foundation model tailored for SITS analysis. At its core, we introduce a spatiotemporal gyroscope attention mechanism that dynamically captures evolving multiscale patterns across both time and space. For pre-training, we curate MillionST, a large-scale dataset of one million images from 100,000 geographic locations, each captured across 10 temporal phases over five years, encompassing diverse geospatial changes and seasonal variations. Leveraging this dataset, we adapt masked image modeling to pre-train TiMo, enabling it to effectively learn and encode generalizable spatiotemporal representations.Extensive experiments across multiple spatiotemporal tasks-including deforestation monitoring, land cover segmentation, crop type classification, and flood detection-demonstrate TiMo's superiority over state-of-the-art methods. Code, model, and dataset will be released at https://github.com/MiliLab/TiMo.
中文: 提出的TiMo模型通过时空陀螺仪注意力机制捕捉卫星图像时间序列中的多尺度模式,并利用大规模MillionST数据集进行预训练,在多项环境监测任务中实现了优于现有方法的性能。
English: The proposed TiMo model introduces a spatiotemporal gyroscope attention mechanism to capture multiscale patterns in satellite image time series, achieving superior performance across multiple environmental monitoring tasks through pre-training on the large-scale MillionST dataset.

Authors:Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
Title: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Abstract:
Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.
中文摘要:提出的视觉引导解码(VGD)方法利用大语言模型和CLIP指导,为文生图模型生成连贯、可读的提示文本,无需额外训练即可在可解释性和上下文相关性上超越现有技术。
English Summary: The proposed Visually Guided Decoding (VGD) method uses large language models and CLIP guidance to generate coherent, human-readable prompts for text-to-image models, outperforming existing techniques in interpretability and contextual relevance without requiring additional training.

Authors:Ziyuan He, Zhiqing Guo, Liejun Wang, Gaobo Yang, Yunfeng Diao, Dan Ma
Title: WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks
Abstract:
Deepfake technology poses increasing risks such as privacy invasion and identity theft. To address these threats, we propose WaveGuard, a proactive watermarking framework that enhances robustness and imperceptibility via frequency-domain embedding and graph-based structural consistency. Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph Neural Network (SC-GNN) to preserve visual quality. We also design an attention module to refine embedding precision. Experimental results on face swap and reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art methods in both robustness and visual quality. Code is available at https://github.com/vpsg-research/WaveGuard.
中文:摘要介绍了WaveGuard,一种利用频域嵌入和图结构一致性的主动水印框架,通过实验证明其在对抗深度伪造威胁方面具有卓越的鲁棒性和视觉质量。
English: The abstract introduces WaveGuard, a proactive watermarking framework that uses frequency-domain embedding and graph-based structural consistency to combat deepfake threats, showing superior robustness and visual quality in experiments.

Authors:Haofeng Liu, Mingqi Gao, Xuxiao Luo, Ziyue Wang, Guanyi Qin, Junde Wu, Yueming Jin
Title: ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking
Abstract:
Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at https://github.com/jinlab-imvr/ReSurgSAM2.
Chinese: ReSurgSAM2提出了一种两阶段的手术场景分割框架,通过跨模态检测和长期跟踪提升精度与效率,实现了61.2 FPS的实时性能。
English: ReSurgSAM2 introduces a two-stage framework for surgical scene segmentation, enhancing accuracy and efficiency through cross-modal detection and long-term tracking, achieving real-time performance at 61.2 FPS.

Authors:Xiao Ni, Carsten Kuehnel, Xiaoyi Jiang
Title: Thermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections
Abstract:
Rapid advances in deep learning for computer vision have driven the adoption of RGB camera-based adaptive traffic light systems to improve traffic safety and pedestrian comfort. However, these systems often overlook the needs of people with mobility restrictions. Moreover, the use of RGB cameras presents significant challenges, including limited detection performance under adverse weather or low-visibility conditions, as well as heightened privacy concerns. To address these issues, we propose a fully automated, thermal detector-based traffic light system that dynamically adjusts signal durations for individuals with walking impairments or mobility burden and triggers the auditory signal for visually impaired individuals, thereby advancing towards barrier-free intersection for all users. To this end, we build the thermal dataset for people with mobility restrictions (TD4PWMR), designed to capture diverse pedestrian scenarios, particularly focusing on individuals with mobility aids or mobility burden under varying environmental conditions, such as different lighting, weather, and crowded urban settings. While thermal imaging offers advantages in terms of privacy and robustness to adverse conditions, it also introduces inherent hurdles for object detection due to its lack of color and fine texture details and generally lower resolution of thermal images. To overcome these limitations, we develop YOLO-Thermal, a novel variant of the YOLO architecture that integrates advanced feature extraction and attention mechanisms for enhanced detection accuracy and robustness in thermal imaging. Experiments demonstrate that the proposed thermal detector outperforms existing detectors, while the proposed traffic light system effectively enhances barrier-free intersection. The source codes and dataset are available at https://github.com/leon2014dresden/YOLO-THERMAL.
中文: 深度学习进步催生了基于RGB摄像头的交通灯系统,但其忽视行动不便者且存在天气和隐私问题,因此开发了基于热成像检测器的系统,结合YOLO-Thermal提升无障碍交叉口的效能。
English: Deep learning advancements have led to RGB-based traffic light systems, but they neglect mobility-impaired individuals and face weather and privacy issues, prompting the development of a thermal detector-based system with YOLO-Thermal for enhanced barrier-free intersections.

Authors:Shan Zhao, Zhitong Xiong, Jie Zhao, Xiao Xiang Zhu
Title: ExEBench: Benchmarking Foundation Models on Extreme Earth Events
Abstract:
Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce \textbf{ExE}Bench (\textbf{Ex}treme \textbf{E}arth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come. The dataset and code are public https://github.com/zhaoshan2/EarthExtreme-Bench.
中文: 摘要介绍了ExEBench基准数据集,用于评估基础模型在极端事件管理中的表现,旨在提升灾害应对能力并深化对气候变化下地球系统的理解。
English: The abstract introduces ExEBench, a benchmark dataset for evaluating foundation models' performance in managing extreme events, aiming to enhance disaster management and understanding of Earth systems under climate change.

Authors:Zheang Huai, Hui Tang, Yi Li, Zhuangzhuang Chen, Xiaomeng Li
Title: Leveraging Segment Anything Model for Source-Free Domain Adaptation via Dual Feature Guided Auto-Prompting
Abstract:
Source-free domain adaptation (SFDA) for segmentation aims at adapting a model trained in the source domain to perform well in the target domain with only the source model and unlabeled target data. Inspired by the recent success of Segment Anything Model (SAM) which exhibits the generality of segmenting images of various modalities and in different domains given human-annotated prompts like bounding boxes or points, we for the first time explore the potentials of Segment Anything Model for SFDA via automatedly finding an accurate bounding box prompt. We find that the bounding boxes directly generated with existing SFDA approaches are defective due to the domain gap. To tackle this issue, we propose a novel Dual Feature Guided (DFG) auto-prompting approach to search for the box prompt. Specifically, the source model is first trained in a feature aggregation phase, which not only preliminarily adapts the source model to the target domain but also builds a feature distribution well-prepared for box prompt search. In the second phase, based on two feature distribution observations, we gradually expand the box prompt with the guidance of the target model feature and the SAM feature to handle the class-wise clustered target features and the class-wise dispersed target features, respectively. To remove the potentially enlarged false positive regions caused by the over-confident prediction of the target model, the refined pseudo-labels produced by SAM are further postprocessed based on connectivity analysis. Experiments on 3D and 2D datasets indicate that our approach yields superior performance compared to conventional methods. Code is available at https://github.com/xmed-lab/DFG.
Chinese: 本研究提出了一种双重特征引导的自动提示方法,通过利用Segment Anything模型自动生成精确的边界框提示来改进无源域自适应分割,在3D和2D数据集上相比传统方法实现了更优的性能。
English: This study introduces a Dual Feature Guided auto-prompting method to enhance source-free domain adaptation for segmentation by leveraging the Segment Anything Model to automatically generate accurate bounding box prompts, achieving superior performance on both 3D and 2D datasets compared to conventional approaches.

Authors:Alexandra Khirianova, Ekaterina Solodneva, Andrey Pudovikov, Sergey Osokin, Egor Samosvat, Yuriy Dorn, Alexander Ledovsky, Yana Zenkova
Title: BAT: Benchmark for Auto-bidding Task
Abstract:
The optimization of bidding strategies for online advertising slot auctions presents a critical challenge across numerous digital marketplaces. A significant obstacle to the development, evaluation, and refinement of real-time autobidding algorithms is the scarcity of comprehensive datasets and standardized benchmarks. To address this deficiency, we present an auction benchmark encompassing the two most prevalent auction formats. We implement a series of robust baselines on a novel dataset, addressing the most salient Real-Time Bidding (RTB) problem domains: budget pacing uniformity and Cost Per Click (CPC) constraint optimization. This benchmark provides a user-friendly and intuitive framework for researchers and practitioners to develop and refine innovative autobidding algorithms, thereby facilitating advancements in the field of programmatic advertising. The implementation and additional resources can be accessed at the following repository (https://github.com/avito-tech/bat-autobidding-benchmark, https://doi.org/10.5281/zenodo.14794182).
中文总结:本文针对在线广告自动竞价领域缺乏数据集和标准基准的问题,提出了一个包含多种拍卖形式和基线的综合基准,重点解决预算均匀分配和点击成本优化两大核心挑战。
English Summary: This paper introduces a comprehensive auction benchmark with robust baselines to address the scarcity of datasets and standards in online advertising autobidding, focusing on budget pacing and CPC optimization.

Authors:Shuai Xu, Sijia Cui, Yanna Wang, Bo Xu, Qi Wang
Title: Strategy-Augmented Planning for Large Language Models via Opponent Exploitation
Abstract:
Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LLMs possess adequate domain expertise. To address that, we introduce a two-stage Strategy-Augmented Planning (SAP) framework that significantly enhances the opponent exploitation capabilities of LLM-based agents by utilizing a critical component, the Strategy Evaluation Network (SEN). Specifically, in the offline stage, we construct an explicit strategy space and subsequently collect strategy-outcome pair data for training the SEN network. During the online phase, SAP dynamically recognizes the opponent's strategies and greedily exploits them by searching best response strategy on the well-trained SEN, finally translating strategy to a course of actions by carefully designed prompts. Experimental results show that SAP exhibits robust generalization capabilities, allowing it to perform effectively not only against previously encountered opponent strategies but also against novel, unseen strategies. In the MicroRTS environment, SAP achieves a $85.35\%$ performance improvement over baseline methods and matches the competitiveness of reinforcement learning approaches against state-of-the-art (SOTA) rule-based AI. Our code is available at https://github.com/hsushuai/SAP.
中文: 该研究提出了一种策略增强规划(SAP)框架,通过策略评估网络(SEN)提升基于大语言模型的智能体在对抗环境中利用对手的能力,实现了显著的性能提升和强大的泛化能力。
English: The study introduces a Strategy-Augmented Planning (SAP) framework that enhances LLM-based agents' ability to exploit opponents by using a Strategy Evaluation Network (SEN), achieving significant performance improvements and robust generalization in adversarial scenarios.

Authors:Wenkui Yang, Zhida Zhang, Xiaoqiang Zhou, Junxian Duan, Jie Cao
Title: TT-DF: A Large-Scale Diffusion-Based Dataset and Benchmark for Human Body Forgery Detection
Abstract:
The emergence and popularity of facial deepfake methods spur the vigorous development of deepfake datasets and facial forgery detection, which to some extent alleviates the security concerns about facial-related artificial intelligence technologies. However, when it comes to human body forgery, there has been a persistent lack of datasets and detection methods, due to the later inception and complexity of human body generation methods. To mitigate this issue, we introduce TikTok-DeepFake (TT-DF), a novel large-scale diffusion-based dataset containing 6,120 forged videos with 1,378,857 synthetic frames, specifically tailored for body forgery detection. TT-DF offers a wide variety of forgery methods, involving multiple advanced human image animation models utilized for manipulation, two generative configurations based on the disentanglement of identity and pose information, as well as different compressed versions. The aim is to simulate any potential unseen forged data in the wild as comprehensively as possible, and we also furnish a benchmark on TT-DF. Additionally, we propose an adapted body forgery detection model, Temporal Optical Flow Network (TOF-Net), which exploits the spatiotemporal inconsistencies and optical flow distribution differences between natural data and forged data. Our experiments demonstrate that TOF-Net achieves favorable performance on TT-DF, outperforming current state-of-the-art extendable facial forgery detection models. For our TT-DF dataset, please refer to https://github.com/HashTAG00002/TT-DF.
Chinese: TikTok-DeepFake数据集的推出填补了人体伪造检测领域的数据空白,同时提出的TOF-Net模型通过利用时空和光流分析实现了优越的检测性能。
English: The introduction of the TikTok-DeepFake dataset addresses the scarcity of resources for human body forgery detection, while the proposed TOF-Net model demonstrates superior performance by leveraging spatiotemporal and optical flow analysis.

Authors:Huiyun Jiang, Zhuang Yang
Title: Adaptive Diffusion Policy Optimization for Robotic Manipulation
Abstract:
Recent studies have shown the great potential of diffusion models in improving reinforcement learning (RL) by modeling complex policies, expressing a high degree of multi-modality, and efficiently handling high-dimensional continuous control tasks. However, there is currently limited research on how to optimize diffusion-based polices (e.g., Diffusion Policy) fast and stably. In this paper, we propose an Adam-based Diffusion Policy Optimization (ADPO), a fast algorithmic framework containing best practices for fine-tuning diffusion-based polices in robotic control tasks using the adaptive gradient descent method in RL. Adaptive gradient method is less studied in training RL, let alone diffusion-based policies. We confirm that ADPO outperforms other diffusion-based RL methods in terms of overall effectiveness for fine-tuning on standard robotic tasks. Concretely, we conduct extensive experiments on standard robotic control tasks to test ADPO, where, particularly, six popular diffusion-based RL methods are provided as benchmark methods. Experimental results show that ADPO acquires better or comparable performance than the baseline methods. Finally, we systematically analyze the sensitivity of multiple hyperparameters in standard robotics tasks, providing guidance for subsequent practical applications. Our video demonstrations are released in https://github.com/Timeless-lab/ADPO.git.
中文摘要:本文提出ADPO框架,利用自适应梯度方法优化扩散策略,在机器人控制任务中展现出优于现有方法的性能,并通过系统实验验证了其有效性。
English Summary: This paper introduces ADPO, an Adam-based framework that efficiently fine-tunes diffusion policies for reinforcement learning in robotics, demonstrating superior performance over existing methods through extensive experiments.

Authors:Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, Xuegong Zhang
Title: Benchmarking AI scientists in omics data-driven biological research
Abstract:
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.
Chinese: BaisBench是一个旨在通过数据分析和推理评估AI科学家自主进行生物学发现能力的新基准,它通过细胞类型注释和科学问题解答等任务弥补了现有评估缺乏真实数据驱动的不足,但当前模型表现仍远不及人类专家。
English: BaisBench is a new benchmark designed to evaluate AI scientists' capabilities in autonomous biological discovery through data analysis and reasoning, addressing the current lack of realistic, data-driven assessments by including tasks like cell type annotation and scientific question answering, though current models still lag behind human experts.

Authors:Nibir Chandra Mandal, Oishee Bintey Hoque, Abhijin Adiga, Samarth Swarup, Mandy Wilson, Lu Feng, Yangfeng Ji, Miaomiao Zhang, Geoffrey Fox, Madhav Marathe
Title: IrrMap: A Large-Scale Comprehensive Dataset for Irrigation Method Mapping
Abstract:
We introduce IrrMap, the first large-scale dataset (1.1 million patches) for irrigation method mapping across regions. IrrMap consists of multi-resolution satellite imagery from LandSat and Sentinel, along with key auxiliary data such as crop type, land use, and vegetation indices. The dataset spans 1,687,899 farms and 14,117,330 acres across multiple western U.S. states from 2013 to 2023, providing a rich and diverse foundation for irrigation analysis and ensuring geospatial alignment and quality control. The dataset is ML-ready, with standardized 224x224 GeoTIFF patches, the multiple input modalities, carefully chosen train-test-split data, and accompanying dataloaders for seamless deep learning model training andbenchmarking in irrigation mapping. The dataset is also accompanied by a complete pipeline for dataset generation, enabling researchers to extend IrrMap to new regions for irrigation data collection or adapt it with minimal effort for other similar applications in agricultural and geospatial analysis. We also analyze the irrigation method distribution across crop groups, spatial irrigation patterns (using Shannon diversity indices), and irrigated area variations for both LandSat and Sentinel, providing insights into regional and resolution-based differences. To promote further exploration, we openly release IrrMap, along with the derived datasets, benchmark models, and pipeline code, through a GitHub repository: https://github.com/Nibir088/IrrMap and Data repository: https://huggingface.co/Nibir/IrrMap, providing comprehensive documentation and implementation details.
中文: IrrMap是首个用于灌溉方法测绘的大规模数据集,包含美国多个农场2013-2023年的多分辨率卫星影像和辅助数据,专为机器学习应用设计,并附带完整文档开放共享。
English: IrrMap is the first large-scale dataset for irrigation method mapping, featuring multi-resolution satellite imagery and auxiliary data from 2013-2023 across U.S. farms, designed for machine learning applications and openly available with full documentation.

Authors:Midi Wan, Pengfei Li, Yizhuo Liang, Di Wu, Yushan Pan, Guangzhen Zhu, Hao Wang
Title: Skeleton-Guided Diffusion Model for Accurate Foot X-ray Synthesis in Hallux Valgus Diagnosis
Abstract:
Medical image synthesis plays a crucial role in providing anatomically accurate images for diagnosis and treatment. Hallux valgus, which affects approximately 19% of the global population, requires frequent weight-bearing X-rays for assessment, placing additional strain on both patients and healthcare providers. Existing X-ray models often struggle to balance image fidelity, skeletal consistency, and physical constraints, particularly in diffusion-based methods that lack skeletal guidance. We propose the Skeletal-Constrained Conditional Diffusion Model (SCCDM) and introduce KCC, a foot evaluation method utilizing skeletal landmarks. SCCDM incorporates multi-scale feature extraction and attention mechanisms, improving the Structural Similarity Index (SSIM) by 5.72% (0.794) and Peak Signal-to-Noise Ratio (PSNR) by 18.34% (21.40 dB). When combined with KCC, the model achieves an average score of 0.85, demonstrating strong clinical applicability. The code is available at https://github.com/midisec/SCCDM.
中文: 提出的骨骼约束条件扩散模型(SCCDM)通过多尺度特征提取和注意力机制,显著提升了用于拇外翻评估的医学图像合成质量,SSIM提高5.72%,PSNR提升18.34%,结合KCC评估方法展现出优异的临床适用性。
English: The proposed Skeletal-Constrained Conditional Diffusion Model (SCCDM) with multi-scale feature extraction and attention mechanisms significantly improves medical image synthesis for hallux valgus assessment, achieving 5.72% higher SSIM and 18.34% better PSNR while demonstrating strong clinical utility when paired with the KCC evaluation method.

Authors:Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song
Title: Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Abstract:
The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
中文摘要:本文综述了新兴的LLM心理测量学领域,该领域运用心理测量工具与理论来应对大语言模型评估中的挑战,致力于建立以人为中心的人工智能系统并推动其发展。
English Summary: This review introduces LLM Psychometrics, an interdisciplinary field using psychometric principles to address the challenges of evaluating large language models beyond traditional benchmarks, aiming to advance human-centered AI systems.

Authors:Wangxuan Fan, Siqi Li, Doudou Zhou, Yohei Okada, Chuan Hong, Molei Liu, Nan Liu
Title: SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation
Abstract:
Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and efficient SV approximation method inspired by stochastic optimization. We analyze variance theoretically, prove linear $Q$-convergence, and demonstrate improved empirical stability and low bias in practice on real-world datasets. In our numerical experiments, SIM-Shapley reduces computation time by up to 85% relative to state-of-the-art baselines while maintaining comparable feature attribution quality. Beyond feature attribution, our stochastic mini-batch iterative framework extends naturally to a broader class of sample average approximation problems, offering a new avenue for improving computational efficiency with stability guarantees. Code is publicly available at https://github.com/nliulab/SIM-Shapley.
中文摘要:SIM-Shapley是一种高效的沙普利值近似方法,能在保持特征归因质量的同时将计算时间减少高达85%,其随机优化框架可扩展至更广泛的样本平均近似问题,并具有理论收敛性和稳定性保证。
English Summary: SIM-Shapley is an efficient Shapley value approximation method that reduces computation time by up to 85% while maintaining attribution quality, extending to broader sample average approximation problems with proven convergence and stability.

Authors:Wangxuan Fan, Siqi Li, Doudou Zhou, Yohei Okada, Chuan Hong, Molei Liu, Nan Liu
Title: SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation
Abstract:
Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and efficient SV approximation method inspired by stochastic optimization. We analyze variance theoretically, prove linear $Q$-convergence, and demonstrate improved empirical stability and low bias in practice on real-world datasets. In our numerical experiments, SIM-Shapley reduces computation time by up to 85% relative to state-of-the-art baselines while maintaining comparable feature attribution quality. Beyond feature attribution, our stochastic mini-batch iterative framework extends naturally to a broader class of sample average approximation problems, offering a new avenue for improving computational efficiency with stability guarantees. Code is publicly available at https://github.com/nliulab/SIM-Shapley.
中文摘要:SIM-Shapley是一种高效的沙普利值近似方法,能在保持特征归因质量的同时将计算时间减少高达85%,其随机优化框架可扩展至更广泛的样本平均近似问题,并具有理论收敛性和稳定性保证。
English Summary: SIM-Shapley is an efficient Shapley value approximation method that reduces computation time by up to 85% while maintaining attribution quality, extending to broader sample average approximation problems with proven convergence and stability.

Authors:He Huang, Qi Yang, Mufan Liu, Yiling Xu, Zhu Li
Title: ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction
Abstract:
Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and results in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at https://github.com/H-Huang774/ADC-GS.git.
中文: 现有4D高斯泼溅方法因逐高斯形变导致效率低下,而提出的ADC-GS通过锚点结构和分层运动管道解决了这一问题,在保持渲染质量的同时实现300%-800%的加速和最优存储效率。
English: Existing 4D Gaussian Splatting methods suffer from inefficiency due to per-Gaussian deformation, but the proposed ADC-GS overcomes this with an anchor-based structure and hierarchical motion pipeline, achieving 300%-800% faster rendering and superior storage efficiency without quality loss.

Authors:Xiannan Huang, Shuhan Qiu
Title: Feature Fitted Online Conformal Prediction for Deep Time Series Forecasting Model
Abstract:
Time series forecasting is critical for many applications, where deep learning-based point prediction models have demonstrated strong performance. However, in practical scenarios, there is also a need to quantify predictive uncertainty through online confidence intervals. Existing confidence interval modeling approaches building upon these deep point prediction models suffer from key limitations: they either require costly retraining, fail to fully leverage the representational strengths of deep models, or lack theoretical guarantees. To address these gaps, we propose a lightweight conformal prediction method that provides valid coverage and shorter interval lengths without retraining. Our approach leverages features extracted from pre-trained point prediction models to fit a residual predictor and construct confidence intervals, further enhanced by an adaptive coverage control mechanism. Theoretically, we prove that our method achieves asymptotic coverage convergence, with error bounds dependent on the feature quality of the underlying point prediction model. Experiments on 12 datasets demonstrate that our method delivers tighter confidence intervals while maintaining desired coverage rates. Code, model and dataset in \href{https://github.com/xiannanhuang/FFDCI}{Github}
中文摘要:本文提出一种轻量级共形预测方法,无需重新训练即可利用预训练模型特征生成有效且更短的置信区间,在12个数据集上验证了其优越性能。
English Summary: This paper introduces a lightweight conformal prediction method that generates valid confidence intervals with shorter lengths without retraining, leveraging features from pre-trained models and demonstrating superior performance across 12 datasets.

Authors:Licheng Zhang, Bach Le, Naveed Akhtar, Siew-Kei Lam, Tuan Ngo
Title: Large Language Models for Computer-Aided Design: A Survey
Abstract:
Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: https://github.com/lichengzhanguom/LLMs-CAD-Survey-Taxonomy
中文摘要:本文首次系统综述了大语言模型与计算机辅助设计的融合应用,梳理了六大关键应用领域并提出了未来研究方向。
English Summary: This article presents the first systematic survey exploring how Large Language Models can enhance Computer-Aided Design workflows, examining six key application areas and proposing future research directions.

Authors:Jiashen, Du, Jesse Yao, Allen Liu, Zhekai Zhang
Title: Are LLMs complicated ethical dilemma analyzers?
Abstract:
One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
中文摘要:本研究通过比较大语言模型与专家及非专家对人类伦理困境的判断,评估其模拟人类伦理推理的能力,发现虽然模型在文本结构对齐方面表现优异,但在情境抽象和历史依据方面仍存在不足。
English Summary: This study evaluates whether large language models can replicate human ethical reasoning by comparing their structured responses to expert and non-expert human judgments across 196 ethical dilemmas, finding that while LLMs excel in lexical alignment they struggle with contextual abstraction and historical grounding.

Authors:Luu Tung Hai, Thinh D. Le, Zhicheng Ding, Qing Tian, Truong-Son Hy
Title: Topology-Guided Knowledge Distillation for Efficient Point Cloud Processing
Abstract:
Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at: https://github.com/HySonLab/PointDistill
中文: 本研究提出了一种新颖的蒸馏框架,利用拓扑感知表示和梯度引导对齐将知识从高容量教师模型迁移至轻量级学生模型,在多个数据集上实现了具有竞争力的性能,同时显著减小了模型规模并缩短了推理时间。
English: This study presents a novel distillation framework using topology-aware representations and gradient-guided alignment to transfer knowledge from a high-capacity teacher to a lightweight student model, achieving competitive performance with significantly reduced model size and inference time across multiple datasets.

Authors:Alexandre Cotorobai, Jorge Miguel Silva, Jose Luis Oliveira
Title: A Federated Random Forest Solution for Secure Distributed Machine Learning
Abstract:
Privacy and regulatory barriers often hinder centralized machine learning solutions, particularly in sectors like healthcare where data cannot be freely shared. Federated learning has emerged as a powerful paradigm to address these concerns; however, existing frameworks primarily support gradient-based models, leaving a gap for more interpretable, tree-based approaches. This paper introduces a federated learning framework for Random Forest classifiers that preserves data privacy and provides robust performance in distributed settings. By leveraging PySyft for secure, privacy-aware computation, our method enables multiple institutions to collaboratively train Random Forest models on locally stored data without exposing sensitive information. The framework supports weighted model averaging to account for varying data distributions, incremental learning to progressively refine models, and local evaluation to assess performance across heterogeneous datasets. Experiments on two real-world healthcare benchmarks demonstrate that the federated approach maintains competitive predictive accuracy - within a maximum 9\% margin of centralized methods - while satisfying stringent privacy requirements. These findings underscore the viability of tree-based federated learning for scenarios where data cannot be centralized due to regulatory, competitive, or technical constraints. The proposed solution addresses a notable gap in existing federated learning libraries, offering an adaptable tool for secure distributed machine learning tasks that demand both transparency and reliable performance. The tool is available at https://github.com/ieeta-pt/fed_rf.
This paper presents a federated learning framework for Random Forest classifiers that enables secure collaborative training across distributed healthcare data while maintaining competitive accuracy within 9% of centralized methods and addressing the existing gap in tree-based federated approaches.
English Summary:

Authors:Yu Cheng, Arushi Goel, Hakan Bilen
Title: Visually Interpretable Subtask Reasoning for Visual Question Answering
Abstract:
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.
中文: VISTAR是一种子任务驱动的训练框架,通过生成逐步推理解释来增强多模态大语言模型,无需外部模型即可同时提升准确性和可解释性。
English: VISTAR is a subtask-driven training framework that enhances multimodal large language models by generating step-by-step reasoning explanations, improving both accuracy and interpretability without external models.

Authors:Héber H. Arcolezi, Mina Alishahi, Adda-Akram Bendoukha, Nesrine Kaaniche
Title: Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness
Abstract:
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to four orders of magnitude. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
Chinese: 本研究系统评估了匿名化技术对机器学习公平性的影响,发现群体公平性可能显著下降,而个体公平性因数据同质性增强常得到改善,揭示了隐私、公平与实用性之间的关键权衡。
English: This study systematically examines how anonymization techniques impact machine learning fairness, revealing that while group fairness can degrade significantly, individual fairness often improves due to increased data homogeneity, highlighting critical trade-offs between privacy, fairness, and utility.

Authors:Joseph Tooby-Smith
Title: Digitalizing Wick's theorem
Abstract:
Wick's theorem is a cornerstone of perturbative quantum field theory. In this paper we announce and discuss the digitalization of Wick's theorem and its proof into the interactive theorem prover Lean 4 as part of the project PhysLean. We do the same for the static and normal-ordered versions of Wick's theorem.
中文: 本文介绍了将Wick定理及其证明数字化并集成到Lean 4中的工作,作为PhysLean项目的一部分,涵盖了静态和正规序版本的Wick定理。
English: This paper presents the digitalization of Wick's theorem and its proof into Lean 4 as part of the PhysLean project, including its static and normal-ordered versions.

Authors:Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Li Yuan, Yonghong Tian
Title: BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Abstract:
Biological protocols are fundamental to reproducibility and safety in life science research. While large language models (LLMs) perform well on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning. While there are several benchmark tasks involving protocol question answering, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs. Experimental results reveal that some models perform well on basic understanding tasks (e.g., \sim70% PQA-Acc., >64% ERR F1), but struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons show diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, BioProBench, through its task design and experimental findings, systematically reveals the fundamental challenges for current LLMs in procedural knowledge understanding, deep adaptability to specific domains, reliability of structured reasoning, and handling of sophisticated precision and safety constraints, providing key directions for future AI in the field of scientific experiment automation. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.
中文: BioProBench是首个针对生物实验规程的大规模评估基准,发现大语言模型在基础理解任务表现良好,但在深度推理和结构化生成方面存在明显不足。
English: BioProBench is the first comprehensive benchmark for evaluating large language models on biological protocols, revealing their strengths in basic understanding but significant struggles with deep reasoning and structured generation tasks.

Authors:Qian Xu, Lei Zhang, Yixiao Liu
Title: Enhancing Trust Management System for Connected Autonomous Vehicles Using Machine Learning Methods: A Survey
Abstract:
Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and multi-domain networks, rendering them vulnerable to various threats. Trust Management Systems (TMS) systematically organize essential steps in the trust mechanism, identifying malicious nodes against internal threats and external threats, as well as ensuring reliable decision-making for more cooperative tasks. Recent advances in machine learning (ML) offer significant potential to enhance TMS, especially for the strict requirements of CAVs, such as CAV nodes moving at varying speeds, and opportunistic and intermittent network behavior. Those features distinguish ML-based TMS from social networks, static IoT, and Social IoT. This survey proposes a novel three-layer ML-based TMS framework for CAVs in the vehicle-road-cloud integration system, i.e., trust data layer, trust calculation layer and trust incentive layer. A six-dimensional taxonomy of objectives is proposed. Furthermore, the principles of ML methods for each module in each layer are analyzed. Then, recent studies are categorized based on traffic scenarios that are against the proposed objectives. Finally, future directions are suggested, addressing the open issues and meeting the research trend. We maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/octoberzzzzz/ML-based-TMS-CAV-Survey.
中文: 本综述为互联自动驾驶车辆提出了一种新颖的三层机器学习信任管理系统框架,分析了目标、原理及近期研究,并指出了未来方向和维护了活跃资源库。
English: This survey introduces a novel three-layer machine learning-based Trust Management System framework for Connected Autonomous Vehicles, analyzing objectives, principles, and recent studies while providing future directions and an active repository.

Authors:Chenze Shao, Fandong Meng, Jie Zhou
Title: Continuous Visual Autoregressive Generation via Score Maximization
Abstract:
Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: https://github.com/shaochenze/EAR.
Chinese: 本文提出了一种连续视觉自回归框架,通过利用严格适当评分规则(主要采用能量分数)来避免矢量量化,从而在连续空间中直接进行视觉自回归生成而无需信息损失。
English: The paper introduces a Continuous Visual AutoRegressive (VAR) framework that eliminates the need for vector quantization by leveraging strictly proper scoring rules, primarily using the energy score to enable direct autoregressive generation of continuous visual data without information loss.

Authors:Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, Lior Wolf, James Glass, Leonid Karlinsky, Raja Giryes
Title: Overflow Prevention Enhances Long-Context Recurrent LLMs
Abstract:
A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
Chinese: 研究发现,尽管经过扩展训练,循环大语言模型仍未充分利用长上下文,但采用基于分块的推理方法仅处理相关输入部分,可显著提升长上下文任务性能并达到最先进水平。
English: The study finds that recurrent large language models underutilize long contexts despite extended training, but a chunk-based inference method that processes only relevant input portions significantly boosts performance on long-context tasks and achieves state-of-the-art results.

Authors:Xinji Mai, Haotian Xu, Zhong-Zhi Li, Xing W, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang
Title: Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving
Abstract:
Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}.
Chinese: ZeroTIR通过基于结果的强化学习训练大语言模型,使其能自主生成并执行Python代码解决数学问题,研究表明训练进程与代码执行频率、回答长度及任务准确率呈可量化的正相关,显著超越了无工具集成的基准方法。
English: ZeroTIR trains large language models through reinforcement learning to autonomously generate and execute Python code for mathematical reasoning, demonstrating that increased training predictably enhances code usage, response length, and accuracy, significantly outperforming baseline methods without tool integration.

Authors:Quang Vinh Nguyen, Minh Duc Nguyen, Thanh Hoang Son Vo, Hyung-Jeong Yang, Soo-Hyung Kim
Title: Anatomical Attention Alignment representation for Radiology Report Generation
Abstract:
Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.
中文: A3Net框架通过整合解剖学知识与视觉特征,提升了自动化放射学报告生成的语义推理能力和报告准确性,在IU X-Ray和MIMIC-CXR数据集上验证了其优越性能。
English: The A3Net framework enhances automated radiology report generation by integrating anatomical knowledge with visual features, improving semantic reasoning and report accuracy as demonstrated on IU X-Ray and MIMIC-CXR datasets.

Authors:Feng Yuan, Yifan Gao, Wenbin Wu, Keqing Wu, Xiaotong Guo, Jie Jiang, Xin Gao
Title: ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation
Abstract:
Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba's selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2's image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba
中文: ABS-Mamba通过整合SAM2的器官语义感知、CNN的局部特征提取与Mamba的依赖建模,在双分辨率框架下实现了保留解剖结构和细节的高保真跨模态医学图像生成,经临床数据集验证优于现有方法。
English: ABS-Mamba introduces a dual-resolution framework combining SAM2 for organ-level semantics, specialized CNNs for local details, and Mamba for feature dependencies, achieving superior cross-modal medical image synthesis validated on clinical datasets.

Authors:Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang
Title: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization
Abstract:
Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data. Recently, vision-language models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance that often surpasses SSL approaches due to their exceptional generalization capabilities. This gap motivates us to question: how can we effectively harness the powerful generalization capabilities of VLMs into task-specific models? Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but we identify that it suffers from gradient conflicts between supervised and distillation losses. To address this challenge, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal. We observe that DHO resolves gradient conflicts, enabling improved feature learning compared to single-head KD baselines, with practical benefits of minimal computational overhead and test-time hyperparameter tuning without retraining. Extensive experiments across 15 datasets show that DHO consistently outperforms KD baselines, often outperforming teacher models with smaller student models. DHO also achieves new state-of-the-art performance on both in-distribution ImageNet semi-supervised learning and out-of-distribution generalization across ImageNet variants. We publicly release our code and model checkpoints to facilitate future research at https://github.com/erjui/DHO.
中文摘要:提出的双头优化方法有效解决了视觉语言模型知识蒸馏中的梯度冲突问题,在多个数据集上以最小计算开销实现了半监督学习的最先进性能。
English Summary: The proposed Dual-Head Optimization (DHO) method effectively resolves gradient conflicts in knowledge distillation from vision-language models, achieving state-of-the-art performance in semi-supervised learning across multiple datasets with minimal computational overhead.

Authors:Dieu-Donne Fangnon, Armandine Sorel Kouyim Meli, Verlon Roel Mbingui, Phanie Dianelle Negho, Regis Konan Marcel Djaha
Title: A comparative study of Bitcoin and Ripple cryptocurrencies trading using Deep Reinforcement Learning algorithms
Abstract:
Artificial intelligence (AI) has demonstrated remarkable success across various applications. In light of this trend, the field of automated trading has developed a keen interest in leveraging AI techniques to forecast the future prices of financial assets. This interest stems from the need to address trading challenges posed by the inherent volatility and dynamic nature of asset prices. However, crafting a flawless strategy becomes a formidable task when dealing with assets characterized by intricate and ever-changing price dynamics. To surmount these formidable challenges, this research employs an innovative rule-based strategy approach to train Deep Reinforcement Learning (DRL). This application is carried out specifically in the context of trading Bitcoin (BTC) and Ripple (XRP). Our proposed approach hinges on the integration of Deep Q-Network, Double Deep Q-Network, Dueling Deep Q-learning networks, alongside the Advantage Actor-Critic algorithms. Each of them aims to yield an optimal policy for our application. To evaluate the effectiveness of our Deep Reinforcement Learning (DRL) approach, we rely on portfolio wealth and the trade signal as performance metrics. The experimental outcomes highlight that Duelling and Double Deep Q-Network outperformed when using XRP with the increasing of the portfolio wealth. All codes are available in this \href{https://github.com/VerlonRoelMBINGUI/RL_Final_Projects_AMMI2023}{\color{blue}Github link}.
中文: 本研究运用深度强化学习技术为比特币和瑞波币开发自动化交易策略,实验表明特定网络架构在提升投资组合收益方面表现更优。
English: This study applies deep reinforcement learning techniques to develop automated trading strategies for Bitcoin and Ripple, demonstrating that specific network architectures yield superior portfolio performance.

Authors:Paul Primus, Florian Schmid, Gerhard Widmer
Title: TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
Abstract:
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
中文: 本研究通过引入带有时序分段描述的数据集和逐帧对比训练方法,增强了音频与文本的时序对齐能力,相比仅使用全局描述的模型,在时序任务上表现更优。
English: The study enhances audio-text alignment by introducing a dataset with temporally segmented descriptions and a frame-wise contrastive training method, improving performance in temporal tasks compared to global caption models.

Authors:LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Kai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
Title: MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
中文: MiMo-7B 是一款专为推理任务优化的语言模型,通过改进的预训练和强化学习方法,在数学、编程及通用推理任务上表现卓越,性能超越更大规模模型及 OpenAI o1-mini。
English: MiMo-7B is a reasoning-optimized language model that demonstrates exceptional performance across mathematics, coding, and general reasoning tasks, surpassing larger models and OpenAI o1-mini through enhanced pre-training and reinforcement learning techniques.

Authors:Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang
Title: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Abstract:
Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
中文摘要:本研究提出了一个多维约束框架和自动化流程,用于生成多样化的代码可验证测试来评估大语言模型的指令遵循能力,揭示了不同约束形式下的显著性能差异,并证明了其在强化学习中的应用价值——通过注意力模块的参数调整有效提升了模型的约束识别与遵循能力。
English Summary: This study introduces a multi-dimensional constraint framework and automated pipeline to generate diverse, code-verifiable tests for evaluating large language models' instruction-following capabilities, revealing significant performance variations across constraint types and demonstrating its utility for reinforcement learning that enhances constraint adherence through attention module modifications.

Authors:Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering
Title: Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework
Abstract:
Kidney abnormality segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and abnormalities, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated kidney abnormality segmentation algorithm, made publicly available for clinical and research use. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.
Chinese: 本研究利用nnU-Net框架开发了一种稳健的肾脏异常分割算法,通过多样化数据集验证确保了高性能和可靠性,所有代码已公开供临床和研究使用。
English: This research develops a robust kidney abnormality segmentation algorithm using the nnU-Net framework, validated across diverse datasets to ensure high performance and reliability, with all code made publicly available for clinical and research use.

Authors:Kamil Jeziorek, Tomasz Kryjak
Title: Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs
Abstract:
Event cameras offer significant advantages over traditional frame-based sensors. These include microsecond temporal resolution, robustness under varying lighting conditions and low power consumption. Nevertheless, the effective processing of their sparse, asynchronous event streams remains challenging. Existing approaches to this problem can be categorised into two distinct groups. The first group involves the direct processing of event data with neural models, such as Spiking Neural Networks or Graph Convolutional Neural Networks. However, this approach is often accompanied by a compromise in terms of qualitative performance. The second group involves the conversion of events into dense representations with handcrafted aggregation functions, which can boost accuracy at the cost of temporal fidelity. This paper introduces a novel Self-Supervised Event Representation (SSER) method leveraging Gated Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event timestamps and polarities without temporal discretisation. The recurrent layers are trained in a self-supervised manner to maximise the fidelity of event-time encoding. The inference is performed with event representations generated asynchronously, thus ensuring compatibility with high-throughput sensors. The experimental validation demonstrates that SSER outperforms aggregation-based baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx object detection datasets. Furthermore, the paper presents the first hardware implementation of recurrent representation for event data on a System-on-Chip FPGA, achieving sub-microsecond latency and power consumption between 1-2 W, suitable for real-time, power-efficient applications. Code is available at https://github.com/vision-agh/RecRepEvent.
中文: 本文提出了一种利用GRU网络的自监督事件表示方法,无需时间离散化即可编码事件数据,在实现卓越目标检测性能的同时,通过硬件部署达到了亚微秒级延迟和低功耗的高效表现。
English: This paper introduces a self-supervised event representation method using GRU networks to encode event data without temporal discretization, achieving superior object detection performance and efficient hardware implementation with sub-microsecond latency and low power consumption.

Authors:Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub
Title: Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
Abstract:
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to high variance when the reward advantage is inaccurately predicted. In this work, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) model, by using lightweight Kalman filtering to dynamically estimate the latent reward baseline and uncertainty. This filtering technique replaces the naive group mean, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through the accuracies and rewards obtained from math question answering and reasoning, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
中文:提出的KRPO模型通过轻量级卡尔曼滤波器动态估计奖励基线和不确定性,改进了GRPO的优势归一化和策略优化稳定性,且无需额外参数。
English: The proposed KRPO model enhances GRPO by using a lightweight Kalman filter to dynamically estimate the reward baseline and uncertainty, improving advantage normalization and policy optimization stability without extra parameters.

Authors:Feng Ding, Tingting Wang, Yupeng Gao, Shuo Yu, Jing Ren, Feng Xia
Title: HALO: Half Life-Based Outdated Fact Filtering in Temporal Knowledge Graphs
Abstract:
Outdated facts in temporal knowledge graphs (TKGs) result from exceeding the expiration date of facts, which negatively impact reasoning performance on TKGs. However, existing reasoning methods primarily focus on positive importance of historical facts, neglecting adverse effects of outdated facts. Besides, training on these outdated facts yields extra computational cost. To address these challenges, we propose an outdated fact filtering framework named HALO, which quantifies the temporal validity of historical facts by exploring the half-life theory to filter outdated facts in TKGs. HALO consists of three modules: the temporal fact attention module, the dynamic relation-aware encoder module, and the outdated fact filtering module. Firstly, the temporal fact attention module captures the evolution of historical facts over time to identify relevant facts. Secondly, the dynamic relation-aware encoder module is designed for efficiently predicting the half life of each fact. Finally, we construct a time decay function based on the half-life theory to quantify the temporal validity of facts and filter outdated facts. Experimental results show that HALO outperforms the state-of-the-art TKG reasoning methods on three public datasets, demonstrating its effectiveness in detecting and filtering outdated facts (Codes are available at https://github.com/yushuowiki/K-Half/tree/main ).
中文: HALO框架通过引入半衰期理论,设计三个模块来量化历史事实的时间有效性并过滤过时信息,在三个公开数据集上显著提升了时序知识图谱的推理性能并降低了计算开销。
English: The HALO framework addresses outdated facts in temporal knowledge graphs by applying half-life theory to quantify temporal validity through three modules, enhancing reasoning performance and reducing computational costs on three public datasets.

Authors:Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas
Title: DocVXQA: Context-Aware Visual Explanations for Document Question Answering
Abstract:
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.
中文: DocVXQA是一种新型视觉自解释文档问答框架,不仅能生成准确答案,还能通过视觉热图突出关键区域,提供上下文充分且表征高效的解释,从而在保证性能的同时增强用户信任。
English: DocVXQA is a visually self-explainable document question answering framework that generates accurate answers while learning visual heatmaps to highlight critical regions, ensuring contextually sufficient and representation-efficient explanations for enhanced user trust and balanced performance.

Authors:Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
Title: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts
Abstract:
Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.
中文: 本文提出了一种名为“快捷扩散优化”(SDO)的高效计算方法,通过在生成过程中仅保留单步计算图来优化扩散模型,在保持性能的同时将计算成本降低约90%。
English: This paper introduces Shortcut Diffusion Optimization (SDO), a computationally efficient method that optimizes diffusion models by retaining only one step's computational graph during generation, reducing costs by approximately 90% while maintaining performance.

Authors:Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijing Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He
Title: Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Abstract:
In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/uni-medical/Ophora.
中文: 本文提出了Ophora模型,能够根据自然语言指令生成高质量眼科手术视频,通过构建专用数据集和渐进式调优方案,有效解决了医疗数据稀缺问题。
English: The paper introduces Ophora, an AI model that generates realistic ophthalmic surgical videos from natural language instructions, addressing data scarcity through a curated dataset and progressive tuning method.

Authors:Peng Sun, Yi Jiang, Tao Lin
Title: Unified Continuous Generative Models
Abstract:
Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.
中文: 本文提出了一个统一框架来训练、采样和分析连续生成模型,在减少采样步骤的同时,为多步和少步方法均实现了最先进的性能。
English: This paper introduces a unified framework for training, sampling, and analyzing continuous generative models, achieving state-of-the-art performance with fewer steps across both multi-step and few-step approaches.

Authors:Truc Mai-Thanh Nguyen, Dat Minh Nguyen, Son T. Luu, Kiet Van Nguyen
Title: ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation
Abstract:
Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP
Chinese: 本文介绍了ViMRHP这一大规模越南语多模态评论有用性预测数据集,通过AI辅助显著降低了标注时间和成本并保持质量,弥补了现有资源中语言多样性的不足。
English: This paper introduces ViMRHP, a large-scale Vietnamese multimodal review helpfulness prediction dataset that utilizes AI assistance to significantly reduce annotation time and costs while maintaining quality, addressing the lack of linguistic diversity in existing resources.

Authors:Wenhao Hu, Paul Henderson, José Cano
Title: ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks
Abstract:
Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning
Chinese: ICE-Pruning是一种创新的迭代剪枝流程,通过优化微调过程在保持模型精度的同时显著加速深度神经网络压缩,最高可实现9.61倍的加速效果。
English: ICE-Pruning is an innovative iterative pipeline that accelerates deep neural network compression by optimizing fine-tuning processes while preserving model accuracy, achieving up to 9.61x speedup.

Authors:Chunpeng Li, Ya-tang Li
Title: Feature Visualization in 3D Convolutional Neural Networks
Abstract:
Understanding the computations of convolutional neural networks requires effective visualization of their kernels. While maximal activation methods have proven successful in highlighting the preferred features of 2D convolutional kernels, directly applying these techniques to 3D convolutions often leads to uninterpretable results due to the higher dimensionality and complexity of 3D features. To address this challenge, we propose a novel visualization approach for 3D convolutional kernels that disentangles their texture and motion preferences. Our method begins with a data-driven decomposition of the optimal input that maximally activates a given kernel. We then introduce a two-stage optimization strategy to extract distinct texture and motion components from this input. Applying our approach to visualize kernels at various depths of several pre-trained models, we find that the resulting visualizations--particularly those capturing motion--clearly reveal the preferred dynamic patterns encoded by 3D kernels. These results demonstrate the effectiveness of our method in providing interpretable insights into 3D convolutional operations. Code is available at https://github.com/YatangLiLab/3DKernelVisualizer.
Chinese: 本文提出了一种新颖的三维卷积核可视化方法,通过两阶段优化策略分离其纹理和运动偏好,从而清晰揭示其编码的动态模式,增强了对三维卷积操作的可解释性。
English: This paper introduces a novel visualization method for 3D convolutional kernels that separates their texture and motion preferences through a two-stage optimization strategy, providing interpretable insights into their dynamic patterns.

Authors:Yuqi Cheng, Yunkang Cao, Dongfang Wang, Weiming Shen, Wenlong Li
Title: Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection
Abstract:
Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the feature confusion problem, which greatly hinders the performance of multi-class methods. Therefore, we introduce a multi-class point cloud anomaly detection method, named GLFM, leveraging global-local feature matching to progressively separate data that are prone to confusion across multiple classes. Specifically, GLFM is structured into three stages: Stage-I proposes an anomaly synthesis pipeline that stretches point clouds to create abundant anomaly data that are utilized to adapt the point cloud feature extractor for better feature representation. Stage-II establishes the global and local memory banks according to the global and local feature distributions of all the training data, weakening the impact of feature confusion on the establishment of the memory bank. Stage-III implements anomaly detection of test data leveraging its feature distance from global and local memory banks. Extensive experiments on the MVTec 3D-AD, Real3D-AD and actual industry parts dataset showcase our proposed GLFM's superior point cloud anomaly detection performance. The code is available at https://github.com/hustCYQ/GLFM-Multi-class-3DAD.
中文: GLFM是一种创新的多类别点云异常检测方法,通过三阶段全局-局部特征匹配策略有效解决特征混淆问题,在多个数据集上展现出卓越的检测性能。
English: GLFM is a novel multi-class point cloud anomaly detection method that employs a three-stage global-local feature matching strategy to effectively address feature confusion and enhance detection performance across diverse datasets.

Authors:Zhiye Xie, Enmei Tu, Xianping Fu, Guoliang Yuan, Yi Han
Title: AIS Data-Driven Maritime Monitoring Based on Transformer: A Comprehensive Review
Abstract:
With the increasing demands for safety, efficiency, and sustainability in global shipping, Automatic Identification System (AIS) data plays an increasingly important role in maritime monitoring. AIS data contains spatial-temporal variation patterns of vessels that hold significant research value in the marine domain. However, due to its massive scale, the full potential of AIS data has long remained untapped. With its powerful sequence modeling capabilities, particularly its ability to capture long-range dependencies and complex temporal dynamics, the Transformer model has emerged as an effective tool for processing AIS data. Therefore, this paper reviews the research on Transformer-based AIS data-driven maritime monitoring, providing a comprehensive overview of the current applications of Transformer models in the marine field. The focus is on Transformer-based trajectory prediction methods, behavior detection, and prediction techniques. Additionally, this paper collects and organizes publicly available AIS datasets from the reviewed papers, performing data filtering, cleaning, and statistical analysis. The statistical results reveal the operational characteristics of different vessel types, providing data support for further research on maritime monitoring tasks. Finally, we offer valuable suggestions for future research, identifying two promising research directions. Datasets are available at https://github.com/eyesofworld/Maritime-Monitoring.
中文: 本文综述了基于Transformer模型的AIS数据在海事监测中的应用,重点关注轨迹预测和行为检测,并整理了公开数据集以支持后续研究。
English: This paper reviews Transformer-based AIS data applications in maritime monitoring, focusing on trajectory prediction and behavior detection while organizing public datasets to support further research.

Authors:Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi
Title: From Search To Sampling: Generative Models For Robust Algorithmic Recourse
Abstract:
Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe's training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: https://github.com/prateekgargx/genre.
中文摘要:GenRe是一种生成式救济模型,通过联合训练邻近性、合理性和有效性目标,利用前向采样生成最低成本救济方案,在多项指标上优于现有方法。
English Summary: GenRe is a generative recourse model that jointly trains for proximity, plausibility, and validity to provide superior recourse recommendations through forward sampling, outperforming existing methods across multiple metrics.

Authors:Keyue Qiu, Yuxuan Song, Zhehuan Fan, Peidong Liu, Zhe Zhang, Mingyue Zheng, Hao Zhou, Wei-Ying Ma
Title: Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule
Abstract:
Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities -- continuous 3D positions and discrete 2D topologies -- which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set. Code is available at https://github.com/AlgoMole/MolCRAFT.
中文摘要:本研究提出的VLB最优调度策略通过优化噪声调度方案,解决了基于结构的药物设计中几何建模的多模态概率路径扭曲问题,在CrossDock数据集上实现了95.9%的最优姿态通过率,性能提升超过10%。
English Summary: The proposed VLB-Optimal Scheduling (VOS) strategy addresses geometric modeling challenges in Structure-Based Drug Design by optimizing noise schedules, achieving a state-of-the-art 95.9% PoseBusters passing rate with significant performance improvements.

Authors:Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
Title: ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data
Abstract:
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.
中文: 本文提出了ALPCAH方法,通过估计样本噪声方差来改进异构数据的子空间基估计,无需分布假设或已知噪声方差,并开发了更快的矩阵分解版本LR-ALPCAH。
English: This paper introduces ALPCAH, a subspace learning method that estimates sample-wise noise variances to enhance subspace basis estimation for heterogeneous data without requiring distributional assumptions or known noise variances, with its faster matrix-factorized version LR-ALPCAH also presented.

Authors:Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne
Title: On the Robustness of Reward Models for Language Model Alignment
Abstract:
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.
Bradley-Terry模型在人类反馈强化学习中易因隐藏状态范数过度分散导致过优化,而提出的批处理零和正则化通过约束奖励极值增强了奖励模型的分布鲁棒性,显著提升了策略与黄金偏好模型的对齐效果。
The Bradley-Terry model in RLHF suffers from over-optimization due to excessive dispersion of hidden states, but the proposed batch-wise sum-to-zero regularization enhances reward model robustness and improves policy alignment in unseen data scenarios.

Authors:Yi Zhang, Ruihong Qiu, Xuwei Xu, Jiajun Liu, Sen Wang
Title: DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward
Abstract:
Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a \textbf{\textit{selector}} is introduced to identify reference users by balancing similarity and diversity so that the \textbf{\textit{recommender}} can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at https://github.com/ArronDZhang/DARLR.
中文: DARLR提出了一种双智能体框架,通过选择器和推荐器动态更新世界模型并调整不确定性惩罚,有效提升了离线强化学习在推荐系统中的奖励准确性和整体性能。
English: DARLR introduces a dual-agent framework with a selector and recommender to dynamically update world models and adapt uncertainty penalties, enhancing offline reinforcement learning for recommender systems by improving reward accuracy and performance.

Authors:Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han
Title: DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker. Since irrelevant documents in RAG systems can mislead the generator, the reranker plays a vital role in refining retrieved documents to enhance generation quality and explainability. However, it is challenging to determine the appropriate number of documents ($k$) that the reranker should select: too few may result in missing critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results among models of same parameter sizes. The model, data and code are available at https://github.com/GasolSun36/DynamicRAG.
中文摘要:DynamicRAG提出了一种基于强化学习的动态重排器,通过大语言模型输出质量作为反馈信号,在检索增强生成系统中自适应调整文档选择顺序和数量,在多个知识密集型数据集上实现了最优性能。
English Summary: DynamicRAG introduces a reinforcement learning-optimized reranker that dynamically adjusts document selection in retrieval-augmented generation systems, achieving state-of-the-art performance across multiple datasets by using LLM output quality as feedback.

Authors:Hongda Qin, Xiao Lu, Zhiyong Wei, Yihong Cao, Kailun Yang, Ningjiang Chen
Title: Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection
Abstract:
Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS.
Chinese: 提出的语言驱动双重风格混合(LDDS)方法通过利用视觉语言模型的语义信息多样化源域,实现了与多种检测器框架兼容的模型无关图像和特征增强,从而提升单域泛化能力。
English: The proposed Language-Driven Dual Style Mixing (LDDS) method enhances single-domain generalization by diversifying the source domain through VLM-based semantic information, enabling model-agnostic image and feature augmentation that works with various detector frameworks.

Authors:Yifan Wei, Xiaoyan Yu, Tengfei Pan, Angsheng Li, Li Du
Title: Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs
Abstract:
Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model's true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements. The code and data for our methods and experiments are available at https://github.com/weiyifan1023/senator.
中文摘要:SENATOR框架通过结构熵和蒙特卡洛树搜索精准识别大语言模型的知识盲区,并生成针对性合成数据进行监督微调,有效提升了模型在专业领域的表现。
English Summary: The SENATOR framework uses structural entropy and Monte Carlo Tree Search to identify and fill knowledge gaps in large language models through targeted synthetic data generation, significantly improving their performance in specialized domains.

Authors:Jeongho Kim, Chanyeong Heo, Jaehee Jung
Title: ReCDAP: Relation-Based Conditional Diffusion with Attention Pooling for Few-Shot Knowledge Graph Completion
Abstract:
Knowledge Graphs (KGs), composed of triples in the form of (head, relation, tail) and consisting of entities and relations, play a key role in information retrieval systems such as question answering, entity search, and recommendation. In real-world KGs, although many entities exist, the relations exhibit a long-tail distribution, which can hinder information retrieval performance. Previous few-shot knowledge graph completion studies focused exclusively on the positive triple information that exists in the graph or, when negative triples were incorporated, used them merely as a signal to indicate incorrect triples. To overcome this limitation, we propose Relation-Based Conditional Diffusion with Attention Pooling (ReCDAP). First, negative triples are generated by randomly replacing the tail entity in the support set. By conditionally incorporating positive information in the KG and non-existent negative information into the diffusion process, the model separately estimates the latent distributions for positive and negative relations. Moreover, including an attention pooler enables the model to leverage the differences between positive and negative cases explicitly. Experiments on two widely used datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. The code is available at https://github.com/hou27/ReCDAP-FKGC.
中文摘要:知识图谱存在关系长尾分布问题,而提出的ReCDAP模型通过生成负三元组并采用条件扩散与注意力池化来区分正负关系,从而提升了图谱补全性能,实现了最先进的成果。
English Summary: Knowledge graphs face challenges from long-tail relation distributions, but the proposed ReCDAP model improves completion by generating negative triples and using conditional diffusion with attention pooling to distinguish positive and negative relations, achieving state-of-the-art results.

Authors:Zheng Yao, Shuai Wang, Guido Zuccon
Title: Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition
Abstract:
Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.
Chinese: 密集检索器的性能主要依赖于预训练知识,微调仅调整神经元激活而非重组知识,但这一模式在如Contriever和LLaMA等模型中并不普遍适用。
English: Dense retrievers rely heavily on pre-trained knowledge for performance, with fine-tuning mainly adjusting neuron activations rather than reorganizing knowledge, though this pattern varies across models like Contriever and LLaMA.

Authors:SangEun Lee, Yubeen Lee, Eunil Park
Title: EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis
Abstract:
Visual emotion analysis, which has gained considerable attention in the field of affective computing, aims to predict the dominant emotions conveyed by an image. Despite advancements in visual emotion analysis with the emergence of vision-language models, we observed that instruction-tuned vision-language models and conventional vision models exhibit complementary strengths in visual emotion analysis, as vision-language models excel in certain cases, whereas vision models perform better in others. This finding highlights the need to integrate these capabilities to enhance the performance of visual emotion analysis. To bridge this gap, we propose EmoVLM-KD, an instruction-tuned vision-language model augmented with a lightweight module distilled from conventional vision models. Instead of deploying both models simultaneously, which incurs high computational costs, we transfer the predictive patterns of a conventional vision model into the vision-language model using a knowledge distillation framework. Our approach first fine-tunes a vision-language model on emotion-specific instruction data and then attaches a distilled module to its visual encoder while keeping the vision-language model frozen. Predictions from the vision language model and the distillation module are effectively balanced by a gate module, which subsequently generates the final outcome. Extensive experiments show that EmoVLM-KD achieves state-of-the-art performance on multiple visual emotion analysis benchmark datasets, outperforming the existing methods while maintaining computational efficiency. The code is available in https://github.com/sange1104/EmoVLM-KD.
Chinese: EmoVLM-KD是一种创新框架,通过知识蒸馏将指令调优的视觉语言模型与传统视觉模型的优势相结合,在保持计算效率的同时实现了视觉情感分析的最先进性能。
English: EmoVLM-KD is a novel framework that enhances visual emotion analysis by integrating the strengths of instruction-tuned vision-language models and conventional vision models through knowledge distillation, achieving state-of-the-art performance while maintaining computational efficiency.

Authors:Zihang Liu, Zhenyu Zhang, Hao Tang
Title: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution
Abstract:
Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at https://github.com/Liu-Zihang/SAMSR.
中文: SAMSR是一种语义引导的扩散框架,通过引入分割掩码优化噪声并优先处理语义丰富区域,显著提升了复杂图像的超分辨率感知质量和细节恢复能力。
English: SAMSR is a semantic-guided diffusion framework that enhances image super-resolution by integrating segmentation masks to refine noise and prioritize semantically rich regions, significantly improving perceptual quality and detail recovery in complex images.

Authors:Zhengye Zhang, Sirui Zhao, Shifeng Liu, Shukang Yin, Xinglong Mao, Tong Xu, Enhong Chen
Title: MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception
Abstract:
Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model's visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at https://github.com/zyzhangUstc/MELLM.
中文摘要:本文提出MELLM模型,通过融合细微面部动作感知与多模态大语言模型,首次实现了对微表情动态特征和情感语义的全面解析,在多个基准测试中展现出卓越的鲁棒性和泛化能力。
English Summary: This paper introduces MELLM, a novel micro-expression analysis model that integrates subtle facial motion perception with multimodal large language models to enhance the recognition of fine-grained dynamic movements and emotional cues, demonstrating superior robustness and generalization in micro-expression understanding.

Authors:Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song
Title: GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
Abstract:
Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.
Chinese: GuidedQuant是一种新颖的量化方法,通过整合末端损失的梯度信息并保持权重间的依赖关系,在各种量化类型中均优于现有技术,提升了模型性能。
English: GuidedQuant is a novel quantization method that enhances model performance by incorporating gradient information from the end loss and maintaining cross-weight dependencies, outperforming existing techniques across various quantization types.

Authors:Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai
Title: Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models
Abstract:
Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.
中文: 视觉语言模型在医学影像应用中潜力巨大,但存在幻觉生成问题;本研究通过构建胃肠道数据集并采用幻觉感知微调方法,有效提升了诊断报告的准确性。
English: Vision-Language Models (VLMs) show promise in medical imaging but suffer from hallucination issues, which are addressed in this study through a curated gastrointestinal dataset and a novel hallucination-aware fine-tuning approach that improves diagnostic report accuracy.

Authors:Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Title: High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution
Abstract:
The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard $3 \times 3$ convolutions with an unfold operation followed by $1 \times 1$ convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24--43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at https://github.com/shangwei5/AMSR
中文: 本文提出了一种无需训练的自适应掩码模块,通过动态聚焦高频区域的计算来加速图像超分辨率,在先进模型中减少24–43%的计算量,同时保持性能。
English: This paper introduces a training-free adaptive masking module that accelerates image super-resolution by dynamically focusing computation on high-frequency regions, reducing FLOPs by 24–43% for state-of-the-art models while maintaining performance.

Authors:Fei Zhou, Yi Li, Mingqing Zhu
Title: Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network
Abstract:
In this paper, the dual-optical attention fusion crowd head point counting model (TAPNet) is proposed to address the problem of the difficulty of accurate counting in complex scenes such as crowd dense occlusion and low light in crowd counting tasks under UAV view. The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images to improve the accuracy and robustness of all-day crowd counting. In order to fully utilize different modal information and solve the problem of inaccurate localization caused by systematic misalignment between image pairs, this paper also proposes an adaptive two-optical feature decomposition fusion module (AFDF). In addition, we optimize the training strategy to improve the model robustness through spatial random offset data augmentation. Experiments on two challenging public datasets, DroneRGBT and GAIIC2, show that the proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes. Code is available at https://github.com/zz-zik/TAPNet
中文: 本文提出TAPNet双光注意力融合模型,通过结合红外图像数据和优化训练策略,显著提升了无人机视角下密集遮挡和低光等复杂场景的人群计数精度。
English: This paper introduces TAPNet, a dual-optical attention fusion model that enhances UAV crowd counting accuracy in complex conditions like occlusion and low light by integrating infrared data and optimized training strategies.

Authors:Lishan Yang, Wei Emma Zhang, Quan Z. Sheng, Lina Yao, Weitong Chen, Ali Shakeri
Title: MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning
Abstract:
In the era of big data, data mining has become indispensable for uncovering hidden patterns and insights from vast and complex datasets. The integration of multimodal data sources further enhances its potential. Multimodal Federated Learning (MFL) is a distributed approach that enhances the efficiency and quality of multimodal learning, ensuring collaborative work and privacy protection. However, missing modalities pose a significant challenge in MFL, often due to data quality issues or privacy policies across the clients. In this work, we present MMiC, a framework for Mitigating Modality incompleteness in MFL within the Clusters. MMiC replaces partial parameters within client models inside clusters to mitigate the impact of missing modalities. Furthermore, it leverages the Banzhaf Power Index to optimize client selection under these conditions. Finally, MMiC employs an innovative approach to dynamically control global aggregation by utilizing Markovitz Portfolio Optimization. Extensive experiments demonstrate that MMiC consistently outperforms existing federated learning architectures in both global and personalized performance on multimodal datasets with missing modalities, confirming the effectiveness of our proposed solution. Our code is available at https://github.com/gotobcn8/MMiC.
中文摘要:MMiC框架通过参数替换、优化客户端选择和动态全局聚合控制,有效解决了多模态联邦学习中的模态缺失问题,其性能显著优于现有方法。
English Summary: The MMiC framework effectively addresses missing modality challenges in Multimodal Federated Learning by implementing parameter replacement, optimized client selection, and dynamic global aggregation control, demonstrating superior performance over existing methods.

Authors:Lhuqita Fazry
Title: A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
Abstract:
$\texttt{BIGBIRD-PEGASUS}$ model achieves $\textit{state-of-the-art}$ on abstractive text summarization for long documents. However it's capacity still limited to maximum of $4,096$ tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained $\texttt{BIGBIRD-PEGASUS}$ model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than $20,000$ tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into $4,096$ tokens. Source code available on $\href{https://github.com/lhfazry/SPIN-summ}{https://github.com/lhfazry/SPIN-summ}$.
Chinese: BIGBIRD-PEGASUS模型在长文档摘要任务中表现卓越,但受限于4096个词元,本研究通过筛选超长文档并分割数据以适配模型长度,进行领域微调,有效解决了性能下降问题。
English: The BIGBIRD-PEGASUS model achieves state-of-the-art performance in abstractive text summarization for long documents but is limited to 4,096 tokens, so this research fine-tunes it on a domain-specific dataset augmented by splitting documents to handle very long texts effectively.

Authors:Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti
Title: Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Abstract:
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.
中文摘要:研究发现,即使在良性数据集上微调大型语言模型也会显著增加其输出的危害性,而一种利用异常样本的新攻击方法严重破坏了多种模型的安全对齐,且现有防御措施大多无效。
English Summary: Fine-tuning large language models on even benign datasets can dangerously increase their harmfulness, and a new attack method using outlier samples severely compromises safety alignment across various models, with most existing defenses proving ineffective.

Authors:Ye Zhu, Yunan Wang, Zitong Yu
Title: Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning
Abstract:
Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.
中文摘要:本文提出了新型多模态假新闻检测数据集(MFND)和浅层-深层多任务学习模型(SDML),通过创新的对比学习与融合机制,充分利用单模态及跨模态特征,有效实现了对高仿真假新闻的检测与定位。
English Summary: This paper introduces a novel Multimodal Fake News Detection dataset (MFND) and proposes a Shallow-Deep Multitask Learning (SDML) model that effectively utilizes unimodal and cross-modal features to detect and localize sophisticated fake news through innovative contrastive learning and fusion techniques.

Authors:Ammar Daskin
Title: Quantum RNNs and LSTMs Through Entangling and Disentangling Power of Unitary Transformations
Abstract:
In this paper, we discuss how quantum recurrent neural networks (RNNs) and their enhanced version, long short-term memory (LSTM) networks, can be modeled using the core ideas presented in Ref.[1], where the entangling and disentangling power of unitary transformations is investigated. In particular, we interpret entangling and disentangling power as information retention and forgetting mechanisms in LSTMs. Therefore, entanglement becomes a key component of the optimization (training) process. We believe that, by leveraging prior knowledge of the entangling power of unitaries, the proposed quantum-classical framework can guide and help to design better-parameterized quantum circuits for various real-world applications.
中文: 本文通过将幺正变换的纠缠与解纠缠能力解释为信息保留与遗忘机制,将量子循环神经网络及其长短期记忆网络建模,使纠缠成为优化训练过程的核心,以指导设计更优参数化量子电路。
English: This paper models quantum RNNs and LSTMs by interpreting the entangling and disentangling power of unitaries as information retention and forgetting mechanisms, making entanglement central to the training process for designing optimized quantum circuits.

Authors:Shalin Anand Jain, Jiazhen Liu, Siva Kailas, Harish Ravichandar
Title: JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes
Abstract:
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot RL (MRRL) policies with realistic robot dynamics and safety constraints, supporting parallelization and hardware acceleration. Our generalizable learning interface integrates easily with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation. Our code is available at https://github.com/GT-STAR-Lab/JaxRobotarium.
中文: JaxRobotarium 是一个基于 Jax 的平台,支持多机器人强化学习的并行化训练与部署,具备真实动力学和安全约束,实现了显著加速,并提供标准化场景和仿真到现实的评估流程。
English: JaxRobotarium is a Jax-powered platform that enables fast, parallelized training and deployment of multi-robot reinforcement learning policies with realistic dynamics and safety constraints, achieving significant speedups and providing standardized scenarios and sim-to-real evaluation.

Authors:Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan Dubiński, Ahmed Nabil Belbachir, Anis Yazidi
Title: Learning Graph Representation of Agent Diffusers
Abstract:
Diffusion-based generative models have significantly advanced text-to-image synthesis, demonstrating impressive text comprehension and zero-shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR-AD (Learning Graph Representation of Agent Diffusers), a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks. LGR-AD models the generation process as a distributed system of interacting agents, each representing an expert sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top-$k$ maximum spanning trees, optimizing the generation process. Each agent's decision-making is guided by a meta-model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR-AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: https://github.com/YousIA/LGR_AD
中文摘要:LGR-AD提出了一种基于图神经网络的多智能体系统,通过动态协调专家子模型优化扩散模型的图像生成过程,在保持准确性与多样性的同时显著超越传统方法。
English Summary: LGR-AD introduces a multi-agent system using graph neural networks to dynamically optimize diffusion-based image generation, outperforming traditional models by balancing accuracy and diversity through collaborative agent coordination.

Authors:Morui Zhu, Yongqi Zhu, Yihao Zhu, Qi Chen, Deyuan Qu, Song Fu, Qing Yang
Title: M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark
Abstract:
We introduce M$^3$CAD, a novel benchmark designed to advance research in generic cooperative autonomous driving. M$^3$CAD comprises 204 sequences with 30k frames, spanning a diverse range of cooperative driving scenarios. Each sequence includes multiple vehicles and sensing modalities, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M$^3$CAD to support both single-vehicle and multi-vehicle autonomous driving research, significantly broadening the scope of research in the field. To our knowledge, M$^3$CAD is the most comprehensive benchmark specifically tailored for cooperative multi-task autonomous driving research. We evaluate the state-of-the-art end-to-end solution on M$^3$CAD to establish baseline performance. To foster cooperative autonomous driving research, we also propose E2EC, a simple yet effective framework for cooperative driving solution that leverages inter-vehicle shared information for improved path planning. We release M$^3$CAD, along with our baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on https://github.com/zhumorui/M3CAD
中文: M³CAD是一个专为协作式自动驾驶研究设计的综合性基准,包含204个序列和3万帧多模态数据,支持多任务研究并公开资源以推动该领域发展。
English: M³CAD is a comprehensive benchmark featuring 204 sequences with 30k frames of multimodal data to advance cooperative autonomous driving research, supporting tasks like detection, tracking, and planning while establishing baseline performance and releasing resources publicly.

Authors:Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Title: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Abstract:
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
中文: 本研究表明,在缩放点积注意力后添加头部特定的Sigmoid门控,通过引入非线性和查询相关的稀疏门控机制,能持续提升模型性能、训练稳定性及长文本处理能力。
English: This study demonstrates that adding a head-specific sigmoid gate after Scaled Dot-Product Attention consistently enhances model performance, training stability, and long-context capabilities by introducing non-linearity and query-dependent sparse gating.

Authors:Dominik Koterwa, Maciej Świtała
Title: Enhancing BERTopic with Intermediate Layer Representations
Abstract:
BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm's performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that performs better than the default setting of BERTopic. Additionally, we investigate the influence of stop words on different embedding configurations.
中文: 本研究评估了BERTopic在三个数据集上的18种嵌入表示,发现优化配置在主题连贯性和多样性上优于默认设置,同时探讨了停用词对不同嵌入配置的影响。
English: This study evaluates 18 embedding representations for BERTopic across three datasets, finding that optimized configurations outperform default settings in topic coherence and diversity while also examining stop words' impact.

Authors:Xuefeng Jiang, Jia Li, Nannan Wu, Zhiyuan Wu, Xujing Li, Sheng Sun, Gang Xu, Yuwei Wang, Qi Li, Min Liu
Title: FNBench: Benchmarking Robust Federated Learning against Noisy Labels
Abstract:
Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at https://github.com/Sprinter1999/FNBench.
中文: 联邦学习面临客户端标签噪声的挑战,为此我们推出了首个基准研究FNBench,在多种噪声模式下评估了18种方法,并提出了一种基于表示的规则化方法以提高抗噪性。
English: Federated learning faces challenges from varying label noise across clients, prompting the creation of FNBench, the first benchmark that evaluates 18 methods under diverse noise patterns and introduces a regularization technique to enhance robustness.

Authors:Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Lei Fan, Zhu Han
Title: Distributionally Robust Contract Theory for Edge AIGC Services in Teleoperation
Abstract:
Advanced AI-Generated Content (AIGC) technologies have injected new impetus into teleoperation, further enhancing its security and efficiency. Edge AIGC networks have been introduced to meet the stringent low-latency requirements of teleoperation. However, the inherent uncertainty of AIGC service quality and the need to incentivize AIGC service providers (ASPs) make the design of a robust incentive mechanism essential. This design is particularly challenging due to both uncertainty and information asymmetry, as teleoperators have limited knowledge of the remaining resource capacities of ASPs. To this end, we propose a distributionally robust optimization (DRO)-based contract theory to design robust reward schemes for AIGC task offloading. Notably, our work extends the contract theory by integrating DRO, addressing the fundamental challenge of contract design under uncertainty. In this paper, contract theory is employed to model the information asymmetry, while DRO is utilized to capture the uncertainty in AIGC service quality. Given the inherent complexity of the original DRO-based contract theory problem, we reformulate it into an equivalent, tractable bi-level optimization problem. To efficiently solve this problem, we develop a Block Coordinate Descent (BCD)-based algorithm to derive robust reward schemes. Simulation results on our unity-based teleoperation platform demonstrate that the proposed method improves teleoperator utility by 2.7\% to 10.74\% under varying degrees of AIGC service quality shifts and increases ASP utility by 60.02\% compared to the SOTA method, i.e., Deep Reinforcement Learning (DRL)-based contract theory. The code and data are publicly available at https://github.com/Zijun0819/DRO-Contract-Theory.
中文: 本文提出了一种基于分布式鲁棒优化的合约理论,用于设计AIGC任务卸载的鲁棒奖励方案,通过解决不确定性和信息不对称问题,有效提升了遥操作效用和服务提供商的效率。
English: This paper presents a distributionally robust optimization-based contract theory to design robust reward schemes for AIGC task offloading, addressing uncertainty and information asymmetry to enhance teleoperation utility and service provider efficiency.

Authors:Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Title: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection
Abstract:
Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.
Chinese: ReplayCAD提出了一种基于扩散模型的生成式回放框架,通过保留像素级细节来缓解持续异常检测中的灾难性遗忘并提升分割精度,在多个数据集上实现了显著的性能提升。
English: ReplayCAD introduces a diffusion-driven generative replay framework that preserves pixel-level details to mitigate catastrophic forgetting and enhance segmentation in continual anomaly detection, achieving state-of-the-art performance with significant improvements.

Authors:Yangguang Shao, Xinjie Lin, Haozheng Luo, Chengshang Hou, Gang Xiong, Jiahao Yu, Junzheng Shi
Title: POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable success in various domains, primarily due to their strong capabilities in reasoning and generating human-like text. Despite their impressive performance, LLMs are susceptible to hallucinations, which can lead to incorrect or misleading outputs. This is primarily due to the lack of up-to-date knowledge or domain-specific information. Retrieval-augmented generation (RAG) is a promising approach to mitigate hallucinations by leveraging external knowledge sources. However, the security of RAG systems has not been thoroughly studied. In this paper, we study a poisoning attack on RAG systems named POISONCRAFT, which can mislead the model to refer to fraudulent websites. Compared to existing poisoning attacks on RAG systems, our attack is more practical as it does not require access to the target user query's info or edit the user query. It not only ensures that injected texts can be retrieved by the model, but also ensures that the LLM will be misled to refer to the injected texts in its response. We demonstrate the effectiveness of POISONCRAFTacross different datasets, retrievers, and language models in RAG pipelines, and show that it remains effective when transferred across retrievers, including black-box systems. Moreover, we present a case study revealing how the attack influences both the retrieval behavior and the step-by-step reasoning trace within the generation model, and further evaluate the robustness of POISONCRAFTunder multiple defense mechanisms. These results validate the practicality of our threat model and highlight a critical security risk for RAG systems deployed in real-world applications. We release our code\footnote{https://github.com/AndyShaw01/PoisonCraft} to support future research on the security and robustness of RAG systems in real-world settings.
中文: 本文提出POISONCRAFT攻击方法,能在无需获取用户查询的情况下,通过污染检索增强生成系统的外部知识库来误导大语言模型引用欺诈网站,实验证明该攻击在不同数据集和模型中都有效,揭示了RAG系统在实际应用中的重大安全风险。
English: This paper introduces POISONCRAFT, a practical poisoning attack on retrieval-augmented generation (RAG) systems that can mislead large language models to reference fraudulent websites without requiring access to user queries, demonstrating its effectiveness across various datasets and models while highlighting critical security risks.

Authors:Maxim Vashkevich, Egor Krivalcevich
Title: Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform
Abstract:
The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02\% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: https://github.com/Mak-Sim/LST-2d
该论文提出一种学习型二维可分离变换(LST)作为神经网络层,通过共享全连接层权重处理图像行列,在MNIST数据集上以仅9.5k参数实现98.02%准确率,并展示了在FPGA平台的高效部署。
This paper introduces a learned two-dimensional separable transform (LST) as a neural network layer that reduces parameters while maintaining high accuracy, achieving 98.02% on MNIST with only 9.5k parameters and demonstrating efficient FPGA implementation.

Authors:Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang
Title: MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG
Abstract:
Long-context large language models (LC LLMs) combined with retrieval-augmented generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained windows, and fragmented information from suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through real-time chunk- and document-level expansions. By initiating with finest-level retrieval and progressively incorporating broader, higher-level context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm MacRAG consistently surpasses baseline RAG pipelines in single- and multi-step generation using Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.
中文: MacRAG提出了一种分层检索增强生成框架,通过自适应融合从粗到细的文档粒度来优化精度和覆盖范围,在多种大语言模型的多跳推理任务中持续超越基线模型。
English: MacRAG introduces a hierarchical RAG framework that adaptively merges coarse-to-fine document granularities to optimize precision and coverage, consistently outperforming baseline models in multi-hop reasoning tasks across various LLMs.

Authors:Danil Belov, Artem Erkhov, Elizaveta Pestova, Ilya Osokin, Dzmitry Tsetserukou, Pavel Osinenko
Title: Quadrupedal Robot Skateboard Mounting via Reverse Curriculum Learning
Abstract:
The aim of this work is to enable quadrupedal robots to mount skateboards using Reverse Curriculum Reinforcement Learning. Although prior work has demonstrated skateboarding for quadrupeds that are already positioned on the board, the initial mounting phase still poses a significant challenge. A goal-oriented methodology was adopted, beginning with the terminal phases of the task and progressively increasing the complexity of the problem definition to approximate the desired objective. The learning process was initiated with the skateboard rigidly fixed within the global coordinate frame and the robot positioned directly above it. Through gradual relaxation of these initial conditions, the learned policy demonstrated robustness to variations in skateboard position and orientation, ultimately exhibiting a successful transfer to scenarios involving a mobile skateboard. The code, trained models, and reproducible examples are available at the following link: https://github.com/dancher00/quadruped-skateboard-mounting
中文: 本研究通过逆向课程强化学习,使四足机器人能够完成滑板上板动作,从简化条件开始训练,最终成功实现移动滑板场景下的稳健操作。
English: This study enables quadrupedal robots to mount skateboards through Reverse Curriculum Reinforcement Learning, starting with simplified conditions and progressively achieving robust performance on mobile boards.

Authors:Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, Kaiyu Huang
Title: Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
Abstract:
The rapid development of Multimodal Large Reasoning Models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, the long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at https://github.com/xinyuelou/Think-in-Safety.
中文: 本研究系统评估了11种多模态大推理模型的安全性,发现普遍存在安全性能下降现象,并提出通过构建安全导向思维链数据集进行微调的新方法,有效利用模型内在推理能力提升其安全防护水平。
English: This study systematically evaluates the safety of 11 Multimodal Large Reasoning Models, revealing prevalent safety degradation and proposing a novel approach that enhances model safety by integrating safety-oriented reasoning processes through fine-tuning with a specially constructed dataset.

Authors:Feng Liu, Ziwang Fu, Yunlong Wang, Qijian Zheng
Title: TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition
Abstract:
The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.
中文: 提出的基于Transformer的自适应跨模态融合网络(TACFN)通过自注意力机制进行模态内特征选择和权重融合,有效解决了跨模态情感识别中的冗余特征问题并增强了互补性,在标准数据集上取得了最优性能。
English: The proposed Transformer-based Adaptive Cross-modal Fusion Network (TACFN) addresses redundancy and enhances complementary feature capture in multimodal emotion recognition by using self-attention for intra-modal selection and weight-based reinforcement, achieving state-of-the-art results on benchmark datasets.

Authors:Ummay Maria Muna, Fahim Hafiz, Shanta Biswas, Riasat Azim
Title: GBDTSVM: Combined Support Vector Machine and Gradient Boosting Decision Tree Framework for efficient snoRNA-disease association prediction
Abstract:
Small nucleolar RNAs (snoRNAs) are increasingly recognized for their critical role in the pathogenesis and characterization of various human diseases. Consequently, the precise identification of snoRNA-disease associations (SDAs) is essential for the progression of diseases and the advancement of treatment strategies. However, conventional biological experimental approaches are costly, time-consuming, and resource-intensive; therefore, machine learning-based computational methods offer a promising solution to mitigate these limitations. This paper proposes a model called 'GBDTSVM', representing a novel and efficient machine learning approach for predicting snoRNA-disease associations by leveraging a Gradient Boosting Decision Tree (GBDT) and Support Vector Machine (SVM). 'GBDTSVM' effectively extracts integrated snoRNA-disease feature representations utilizing GBDT and SVM is subsequently utilized to classify and identify potential associations. Furthermore, the method enhances the accuracy of these predictions by incorporating Gaussian kernel profile similarity for both snoRNAs and diseases. Experimental evaluation of the GBDTSVM model demonstrated superior performance compared to state-of-the-art methods in the field, achieving an area under the receiver operating characteristic (AUROC) of 0.96 and an area under the precision-recall curve (AUPRC) of 0.95 on MDRF dataset. Moreover, our model shows superior performance on two more datasets named LSGT and PsnoD. Additionally, a case study on the predicted snoRNA-disease associations verified the top 10 predicted snoRNAs across nine prevalent diseases, further validating the efficacy of the GBDTSVM approach. These results underscore the model's potential as a robust tool for advancing snoRNA-related disease research. Source codes and datasets our proposed framework can be obtained from: https://github.com/mariamuna04/gbdtsvm
中文: 本文提出GBDTSVM模型,通过结合梯度提升决策树与支持向量机来精准预测snoRNA与疾病关联,在多个数据集上展现出卓越性能(AUROC达0.96,AUPRC达0.95),显著优于现有方法。
English: This paper introduces GBDTSVM, a novel machine learning model that combines Gradient Boosting Decision Tree and Support Vector Machine to accurately predict snoRNA-disease associations, demonstrating superior performance with AUROC of 0.96 and AUPRC of 0.95 compared to existing methods.

Authors:Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, Xin Wang
Title: Improving Generalization of Medical Image Registration Foundation Model
Abstract:
Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at https://github.com/Promise13/fm_sam}{https://github.com/Promise13/fm\_sam.
中文: 本文通过将锐度感知最小化(SAM)融入基础模型,提升了医学图像配准的泛化能力和鲁棒性,使其能更好地适应不同数据集和复杂临床情况。
English: This paper enhances medical image registration by integrating Sharpness-Aware Minimization (SAM) into foundation models, improving generalization and robustness across diverse datasets and clinical scenarios.

Authors:Larry Preuett, Qiuyi Zhang, Muhammad Aurangzeb Ahmad
Title: Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective
Abstract:
In many real-world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes of any chosen policy. We address both forms of uncertainty as a first step toward safer algorithms in partially observable settings. Specifically, we extend Distributional Reinforcement Learning (DistRL)-which models the entire return distribution for fully observable domains-to Partially Observable Markov Decision Processes (POMDPs), allowing an agent to learn the distribution of returns for each conditional plan. Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning. By tracking return distributions, DPBVI naturally enables risk-sensitive control in domains where rare, high-impact events must be carefully managed. We provide source code to foster further research in robust decision-making under partial observability.
中文: 本研究将分布式强化学习扩展至部分可观测环境,通过引入分布式贝尔曼算子和psi向量表示法,提出了DPBVI算法以实现风险敏感控制。
English: This study extends Distributional Reinforcement Learning to partially observable environments by introducing distributional Bellman operators and a psi-vector representation, enabling risk-sensitive control through the proposed DPBVI algorithm.

Authors:Hang Wang, Zhi-Qi Cheng, Chenhao Lin, Chao Shen, Lei Zhang
Title: HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation
Abstract:
Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image generation. Our code is available at https://github.com/hwang-cs-ime/HCMA.
中文: 提出的分层跨模态对齐(HCMA)框架通过结合全局语义对齐和基于边界框的局部空间控制,在文本到图像生成中实现了对复杂语义的精确捕捉与空间约束的严格遵循,在基准测试中以显著提升的FID和CLIP分数刷新了最优性能。
English: The proposed Hierarchical Cross-Modal Alignment (HCMA) framework enhances text-to-image generation by integrating global semantic alignment and local spatial control through bounding-box layouts, achieving state-of-the-art performance on benchmark datasets with significant improvements in FID and CLIP Score metrics.

Authors:Haoyang Xie, Feng Ju
Title: Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities
Abstract:
Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: https://github.com/Text-to-CadQuery/Text-to-CadQuery.
Chinese: 本研究提出了一种利用预训练大语言模型直接从文本生成CadQuery代码的方法,省去中间表示步骤,在创建参数化3D模型方面实现了准确性和效率的显著提升。
English: This research introduces a method to generate CadQuery code directly from text using pretrained large language models, eliminating intermediate representations and achieving significant improvements in accuracy and efficiency for creating parametric 3D models.

Authors:Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet
Title: PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
Abstract:
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super-Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional SR methods, even with limited training data (e.g., only 13% of training data is required to achieve performance similar to SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning by improving accuracy and efficiency, enhancing process understanding, and broadening applications to scientific research. We publicly release the complete source code of PC-SRGAN and all experiments at https://github.com/hasan-rakibul/PC-SRGAN.
中文:PC-SRGAN通过生成物理一致的图像革新了超分辨率技术,在极少训练数据下显著提升精度和效率,并推动科学机器学习应用的发展。
English: PC-SRGAN revolutionizes Super-Resolution by generating physically consistent images, significantly improving accuracy and efficiency with minimal training data while advancing scientific machine learning applications.

Authors:Yanjun Lin, Kai Zhang, Zhenying He, Yinan Jing, X. Sean Wang
Title: Survey of Filtered Approximate Nearest Neighbor Search over the Vector-Scalar Hybrid Data
Abstract:
Filtered approximate nearest neighbor search (FANNS), an extension of approximate nearest neighbor search (ANNS) that incorporates scalar filters, has been widely applied to constrained retrieval of vector data. Despite its growing importance, no dedicated survey on FANNS over the vector-scalar hybrid data currently exists, and the field has several problems, including inconsistent definitions of the search problem, insufficient framework for algorithm classification, and incomplete analysis of query difficulty. This survey paper formally defines the concepts of hybrid dataset and hybrid query, as well as the corresponding evaluation metrics. Based on these, a pruning-focused framework is proposed to classify and summarize existing algorithms, providing a broader and finer-grained classification framework compared to the existing ones. In addition, a review is conducted on representative hybrid datasets, followed by an analysis on the difficulty of hybrid queries from the perspective of distribution relationships between data and queries. This paper aims to establish a structured foundation for FANNS over the vector-scalar hybrid data, facilitate more meaningful comparisons between FANNS algorithms, and offer practical recommendations for practitioners. The code used for downloading hybrid datasets and analyzing query difficulty is available at https://github.com/lyj-fdu/FANNS
中文: 本综述正式定义了向量-标量混合数据的过滤近似最近邻搜索(FANNS),提出了基于剪枝的算法分类框架,并通过分析查询难度为该领域建立基础标准与实践指导。
English: This survey formally defines filtered approximate nearest neighbor search (FANNS) for vector-scalar hybrid data, proposes a pruning-based classification framework for existing algorithms, and analyzes query difficulty to establish foundational standards and practical guidelines for the field.

Authors:Chathurangi Shyalika, Renjith Prasad, Alaa Al Ghazo, Darssan Eswaramoorthi, Harleen Kaur, Sara Shree Muthuselvam, Amit Sheth
Title: SmartPilot: A Multiagent CoPilot for Adaptive and Intelligent Manufacturing
Abstract:
In the dynamic landscape of Industry 4.0, achieving efficiency, precision, and adaptability is essential to optimize manufacturing operations. Industries suffer due to supply chain disruptions caused by anomalies, which are being detected by current AI models but leaving domain experts uncertain without deeper insights into these anomalies. Additionally, operational inefficiencies persist due to inaccurate production forecasts and the limited effectiveness of traditional AI models for processing complex sensor data. Despite these advancements, existing systems lack the seamless integration of these capabilities needed to create a truly unified solution for enhancing production and decision-making. We propose SmartPilot, a neurosymbolic, multiagent CoPilot designed for advanced reasoning and contextual decision-making to address these challenges. SmartPilot processes multimodal sensor data and is compact to deploy on edge devices. It focuses on three key tasks: anomaly prediction, production forecasting, and domain-specific question answering. By bridging the gap between AI capabilities and real-world industrial needs, SmartPilot empowers industries with intelligent decision-making and drives transformative innovation in manufacturing. The demonstration video, datasets, and supplementary materials are available at https://github.com/ChathurangiShyalika/SmartPilot.
中文摘要:SmartPilot是一种神经符号多智能体协导系统,通过处理边缘设备上的多模态传感器数据,实现异常预测、生产预测和领域特定问答,从而提升制造业的智能化决策能力。
English Summary: SmartPilot is a neurosymbolic multiagent CoPilot that enhances manufacturing by predicting anomalies, forecasting production, and answering domain-specific questions through processing multimodal sensor data on edge devices.

Authors:Shehryar Khattak, Timon Homberger, Lukas Bernreiter, Julian Nubert, Olov Andersson, Roland Siegwart, Kostas Alexis, Marco Hutter
Title: CompSLAM: Complementary Hierarchical Multi-Modal Localization and Mapping for Robot Autonomy in Underground Environments
Abstract:
Robot autonomy in unknown, GPS-denied, and complex underground environments requires real-time, robust, and accurate onboard pose estimation and mapping for reliable operations. This becomes particularly challenging in perception-degraded subterranean conditions under harsh environmental factors, including darkness, dust, and geometrically self-similar structures. This paper details CompSLAM, a highly resilient and hierarchical multi-modal localization and mapping framework designed to address these challenges. Its flexible architecture achieves resilience through redundancy by leveraging the complementary nature of pose estimates derived from diverse sensor modalities. Developed during the DARPA Subterranean Challenge, CompSLAM was successfully deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run. Furthermore, it has proven to be a reliable odometry and mapping solution in various subsequent projects, with extensions enabling multi-robot map sharing for marsupial robotic deployments and collaborative mapping. This paper also introduces a comprehensive dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course. This dataset evaluates CompSLAM's robustness to sensor degradations as the robot traverses 740 meters in an environment characterized by highly variable geometries and demanding lighting conditions. The CompSLAM code and the DARPA SubT Finals dataset are made publicly available for the benefit of the robotics community
Chinese: CompSLAM是一种在DARPA地下挑战赛中开发的多模态定位与绘图框架,能在恶劣的地下环境中实现稳健的实时操作,并已成功应用于多种机器人平台。
English: CompSLAM is a resilient, multi-modal framework for real-time localization and mapping in challenging underground environments, developed during the DARPA Subterranean Challenge and successfully deployed on various robots.

Authors:Ruijian Zha, Bojun Liu
Title: A New DAPO Algorithm for Stock Trading
Abstract:
Recent advances in reinforcement learning, such as Dynamic Sampling Policy Optimization (DAPO), show strong performance when paired with large language models (LLMs). Motivated by this success, we ask whether similar gains can be realized in financial trading. We design a trading agent that combines an improved Group Relative Policy Optimization (GRPO) algorithm, augmented with ideas from DAPO, with LLM-based risk and sentiment signals extracted from financial news. On the NASDAQ-100 index (FNSPID dataset), our agent attains a cumulative return of 230.49 percent and an information ratio of 0.37, outperforming the CPPO-DeepSeek baseline. It also cuts training time from about 8 hours to 2.5 hours over 100 epochs while markedly reducing RAM usage. The proposed RL-LLM framework offers a scalable path toward data-efficient trading agents. Code: https://github.com/Ruijian-Zha/FinRL-DAPO-SR/
Chinese: 该RL-LLM框架通过融合改进的GRPO算法与基于LLM的风险情绪信号,在纳斯达克100指数上实现了230.49%的累计收益,同时显著缩短了训练时间并降低了内存占用。
English: The proposed RL-LLM framework, which integrates an enhanced GRPO algorithm with LLM-derived risk and sentiment signals, achieves a 230.49% cumulative return on the NASDAQ-100 while reducing training time and RAM usage.

Authors:Everest Yang, Ria Vasishtha, Luqman K. Dad, Lisa A. Kachnic, Andrew Hope, Eric Wang, Xiao Wu, Yading Yuan, David J. Brenner, Igor Shuryak
Title: CAST: Time-Varying Treatment Effects with Application to Chemotherapy and Radiotherapy on Head and Neck Squamous Cell Carcinoma
Abstract:
Causal machine learning (CML) enables individualized estimation of treatment effects, offering critical advantages over traditional correlation-based methods. However, existing approaches for medical survival data with censoring such as causal survival forests estimate effects at fixed time points, limiting their ability to capture dynamic changes over time. We introduce Causal Analysis for Survival Trajectories (CAST), a novel framework that models treatment effects as continuous functions of time following treatment. By combining parametric and non-parametric methods, CAST overcomes the limitations of discrete time-point analysis to estimate continuous effect trajectories. Using the RADCURE dataset [1] of 2,651 patients with head and neck squamous cell carcinoma (HNSCC) as a clinically relevant example, CAST models how chemotherapy and radiotherapy effects evolve over time at the population and individual levels. By capturing the temporal dynamics of treatment response, CAST reveals how treatment effects rise, peak, and decline over the follow-up period, helping clinicians determine when and for whom treatment benefits are maximized. This framework advances the application of CML to personalized care in HNSCC and other life-threatening medical conditions. Source code/data available at: https://github.com/CAST-FW/HNSCC
中文: CAST框架通过结合参数与非参数方法,将治疗效果建模为时间的连续函数,揭示了头颈癌治疗效应随时间的动态变化,从而克服了传统固定时间点分析的局限,推动了个性化医疗的发展。
English: CAST introduces a continuous time-based framework for estimating dynamic treatment effects in survival data, overcoming the limitations of fixed-time analyses by modeling how effects evolve, peak, and decline, as demonstrated in a head and neck cancer study.

Authors:Chathurangi Shyalika, Renjith Prasad, Fadi El Kalach, Revathy Venkataramanan, Ramtin Zand, Ramy Harik, Amit Sheth
Title: NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines
Abstract:
In modern assembly pipelines, identifying anomalies is crucial in ensuring product quality and operational efficiency. Conventional single-modality methods fail to capture the intricate relationships required for precise anomaly prediction in complex predictive environments with abundant data and multiple modalities. This paper proposes a neurosymbolic AI and fusion-based approach for multimodal anomaly prediction in assembly pipelines. We introduce a time series and image-based fusion model that leverages decision-level fusion techniques. Our research builds upon three primary novel approaches in multimodal learning: time series and image-based decision-level fusion modeling, transfer learning for fusion, and knowledge-infused learning. We evaluate the novel method using our derived and publicly available multimodal dataset and conduct comprehensive ablation studies to assess the impact of our preprocessing techniques and fusion model compared to traditional baselines. The results demonstrate that a neurosymbolic AI-based fusion approach that uses transfer learning can effectively harness the complementary strengths of time series and image data, offering a robust and interpretable approach for anomaly prediction in assembly pipelines with enhanced performance. \noindent The datasets, codes to reproduce the results, supplementary materials, and demo are available at https://github.com/ChathurangiShyalika/NSF-MAP.
中文摘要:本文提出了一种基于神经符号人工智能和融合的方法,用于装配流水线中的多模态异常预测,通过决策级融合和迁移学习结合时间序列与图像数据,以提升性能与可解释性。
English Summary: This paper introduces a neurosymbolic AI and fusion-based method for multimodal anomaly prediction in assembly pipelines, combining time series and image data through decision-level fusion and transfer learning to improve performance and interpretability.

Authors:Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu
Title: Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Representation Learning
Abstract:
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.
Chinese: 本文提出了一种新颖框架,将图学习与大语言模型相结合,无需额外训练或任务特定提示即可显著提升模型在多任务中的推理灵活性和性能。
English: This paper introduces a novel framework that integrates graph learning with Large Language Models (LLMs) to enhance their reasoning flexibility and performance across various tasks without requiring extra training or task-specific prompts.

Authors:Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu
Title: Defending against Indirect Prompt Injection by Instruction Detection
Abstract:
The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.
中文: 大型语言模型与外部数据源的结合易受间接提示注入攻击,本文提出的InstructDetector通过分析模型行为状态,能有效检测此类攻击并显著降低其成功率。
English: The integration of Large Language Models with external sources introduces vulnerabilities to Indirect Prompt Injection attacks, which this paper addresses through InstructDetector, a novel detection method that leverages behavioral states and achieves high accuracy in identifying and mitigating such threats.

Authors:Wei Xiong, Junming Lin, Jiangtong Li, Jie Li, Changjun Jiang
Title: ALFEE: Adaptive Large Foundation Model for EEG Representation
Abstract:
While foundation models excel in text, image, and video domains, the critical biological signals, particularly electroencephalography(EEG), remain underexplored. EEG benefits neurological research with its high temporal resolution, operational practicality, and safety profile. However, low signal-to-noise ratio, inter-subject variability, and cross-paradigm differences hinder the generalization of current models. Existing methods often employ simplified strategies, such as a single loss function or a channel-temporal joint representation module, and suffer from a domain gap between pretraining and evaluation tasks that compromises efficiency and adaptability. To address these limitations, we propose the Adaptive Large Foundation model for EEG signal representation(ALFEE) framework, a novel hybrid transformer architecture with two learning stages for robust EEG representation learning. ALFEE employs a hybrid attention that separates channel-wise feature aggregation from temporal dynamics modeling, enabling robust EEG representation with variable channel configurations. A channel encoder adaptively compresses variable channel information, a temporal encoder captures task-guided evolution, and a hybrid decoder reconstructs signals in both temporal and frequency domains. During pretraining, ALFEE optimizes task prediction, channel and temporal mask reconstruction, and temporal forecasting to enhance multi-scale and multi-channel representation. During fine-tuning, a full-model adaptation with a task-specific token dictionary and a cross-attention layer boosts performance across multiple tasks. After 25,000 hours of pretraining, extensive experimental results on six downstream EEG tasks demonstrate the superior performance of ALFEE over existing models. Our ALFEE framework establishes a scalable foundation for biological signal analysis with implementation at https://github.com/xw1216/ALFEE.
Chinese: ALFEE框架采用混合Transformer架构,通过多阶段学习和混合注意力机制解决脑电信号的关键难题,在六项下游任务中经过大规模预训练后展现出卓越性能。
English: The ALFEE framework introduces a hybrid transformer architecture that overcomes EEG signal challenges through multi-stage learning and hybrid attention, achieving superior performance across six tasks after extensive pretraining.

Authors:Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng
Title: InfoNCE is a Free Lunch for Semantically guided Graph Contrastive Learning
Abstract:
As an important graph pre-training method, Graph Contrastive Learning (GCL) continues to play a crucial role in the ongoing surge of research on graph foundation models or LLM as enhancer for graphs. Traditional GCL optimizes InfoNCE by using augmentations to define self-supervised tasks, treating augmented pairs as positive samples and others as negative. However, this leads to semantically similar pairs being classified as negative, causing significant sampling bias and limiting performance. In this paper, we argue that GCL is essentially a Positive-Unlabeled (PU) learning problem, where the definition of self-supervised tasks should be semantically guided, i.e., augmented samples with similar semantics are considered positive, while others, with unknown semantics, are treated as unlabeled. From this perspective, the key lies in how to extract semantic information. To achieve this, we propose IFL-GCL, using InfoNCE as a "free lunch" to extract semantic information. Specifically, We first prove that under InfoNCE, the representation similarity of node pairs aligns with the probability that the corresponding contrastive sample is positive. Then we redefine the maximum likelihood objective based on the corrected samples, leading to a new InfoNCE loss function. Extensive experiments on both the graph pretraining framework and LLM as an enhancer show significantly improvements of IFL-GCL in both IID and OOD scenarios, achieving up to a 9.05% improvement, validating the effectiveness of semantically guided. Code for IFL-GCL is publicly available at: https://github.com/Camel-Prince/IFL-GCL.
中文: 本文提出IFL-GCL方法,将图对比学习重新定义为正例-未标记学习问题,通过利用InfoNCE提取语义信息并修正采样偏差,在独立同分布和非独立同分布场景下均实现了显著性能提升。
English: This paper reinterprets Graph Contrastive Learning (GCL) as a Positive-Unlabeled learning problem and proposes IFL-GCL, a method that leverages InfoNCE to extract semantic information and correct sampling bias, achieving significant performance improvements in both IID and OOD scenarios.

Authors:Gabriele Rosi, Fabio Cermelli
Title: Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation
Abstract:
Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.
中文: Show or Tell (SoT)基准测试在14个数据集中评估了语义分割的视觉与文本提示,发现开放词汇方法擅长处理文本易描述概念而视觉提示结果波动较大,为视觉基础模型的未来发展提供了重要参考。
English: The Show or Tell (SoT) benchmark evaluates both visual and textual prompts for semantic segmentation across 14 datasets, revealing that open-vocabulary methods excel with text-friendly concepts while visual prompts show high variability, offering insights for future vision model development.

Authors:Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen
Title: PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
Abstract:
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.
中文: 本文提出PARM,一种统一的自回归奖励模型,通过偏好感知双线性适配实现精确的多目标对齐,相比现有方法降低了推理成本并提升了性能。
English: This paper introduces PARM, a unified autoregressive reward model that uses preference-aware bilinear adaptation to achieve precise multi-objective alignment with reduced inference costs and improved performance compared to existing methods.

Authors:Junzhou Xu, Boyu Diao
Title: A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning
Abstract:
As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number of parameters required for fine-tuning and training time, introducing greater parameter redundancy. To address these challenges, we propose a method for allocating expert numbers based on parameter sensitivity LoRA-SMoE (A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning). This method rapidly assesses the sensitivity of different tasks to parameters by sampling a small amount of data and using gradient information. It then adaptively allocates expert numbers within a given budget. The process maintains comparable memory consumption to LoRA (Low-Rank Adaptation) while ensuring an efficient and resource-friendly fine-tuning procedure. Experimental results demonstrate that compared to SOTA fine-tuning methods, our LoRA-SMoE approach can enhance model performance while reducing the number of trainable parameters. This significantly improves model performance in resource-constrained environments. Additionally, due to its efficient parameter sensitivity evaluation mechanism, LoRA-SMoE requires minimal computational overhead to optimize expert allocation, making it particularly suitable for scenarios with limited computational resources. All the code in this study will be made publicly available following the acceptance of the paper for publication. Source code is at https://github.com/EMLS-ICTCAS/LoRA-SMoE
中文: 提出的LoRA-SMoE方法基于参数敏感度自适应分配专家数量,在提升模型性能的同时减少可训练参数,特别适合资源受限的环境。
English: The proposed LoRA-SMoE method adaptively allocates expert numbers based on parameter sensitivity to enhance model performance while reducing trainable parameters, making it particularly efficient for resource-constrained environments.

Authors:Zhiyu Zhu, Jiayu Zhang, Zhibo Jin, Fang Chen, Jianlong Zhou
Title: ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability
Abstract:
Attribution algorithms are essential for enhancing the interpretability and trustworthiness of deep learning models by identifying key features driving model decisions. Existing frameworks, such as InterpretDL and OmniXAI, integrate multiple attribution methods but suffer from scalability limitations, high coupling, theoretical constraints, and lack of user-friendly implementations, hindering neural network transparency and interoperability. To address these challenges, we propose Attribution-Based Explainability (ABE), a unified framework that formalizes Fundamental Attribution Methods and integrates state-of-the-art attribution algorithms while ensuring compliance with attribution axioms. ABE enables researchers to develop novel attribution techniques and enhances interpretability through four customizable modules: Robustness, Interpretability, Validation, and Data & Model. This framework provides a scalable, extensible foundation for advancing attribution-based explainability and fostering transparent AI systems. Our code is available at: https://github.com/LMBTough/ABE-XAI.
中文: 提出的基于归因的可解释性(ABE)框架统一了核心归因方法,通过可定制模块解决现有工具的可扩展性和易用性不足,推动透明人工智能系统的发展。
English: The proposed Attribution-Based Explainability (ABE) framework unifies fundamental attribution methods to overcome scalability and usability limitations in existing tools, offering modular components to advance transparent AI systems.

Authors:Wenqi Zeng, Yuqi Sun, Chenxi Ma, Weimin Tan, Bo Yan
Title: MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks
Abstract:
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin
中文: 该研究提出了MM-Skin,一个包含多种成像模式和大规模高质量图文对的多模态皮肤病数据集,并开发了SkinVL这一专业视觉语言模型,在皮肤病解读任务中展现出优于现有模型的卓越性能。
English: The study introduces MM-Skin, a large-scale multimodal dermatology dataset with diverse imaging modalities and high-quality image-text pairs, and develops SkinVL, a specialized vision-language model that demonstrates superior performance in skin disease interpretation compared to existing models.

Authors:Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
Title: UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Abstract:
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
中文: UniVLA提出了一种新框架,通过从视频中提取任务中心化的动作表示来学习跨具身的视觉-语言-动作策略,能以极少的计算资源和数据实现最先进的性能,并可部署到多种机器人平台。
English: UniVLA introduces a novel framework that learns cross-embodiment vision-language-action policies by deriving task-centric action representations from videos, enabling deployment across various robots with state-of-the-art results using significantly less computational resources and data.

Authors:Wendy Carvalho, Meriem Elkoudi, Brendan Hertel, Reza Azadeh
Title: Parameter-Free Segmentation of Robot Movements with Cross-Correlation Using Different Similarity Metrics
Abstract:
Often, robots are asked to execute primitive movements, whether as a single action or in a series of actions representing a larger, more complex task. These movements can be learned in many ways, but a common one is from demonstrations presented to the robot by a teacher. However, these demonstrations are not always simple movements themselves, and complex demonstrations must be broken down, or segmented, into primitive movements. In this work, we present a parameter-free approach to segmentation using techniques inspired by autocorrelation and cross-correlation from signal processing. In cross-correlation, a representative signal is found in some larger, more complex signal by correlating the representative signal with the larger signal. This same idea can be applied to segmenting robot motion and demonstrations, provided with a representative motion primitive. This results in a fast and accurate segmentation, which does not take any parameters. One of the main contributions of this paper is the modification of the cross-correlation process by employing similarity metrics that can capture features specific to robot movements. To validate our framework, we conduct several experiments of complex tasks both in simulation and in real-world. We also evaluate the effectiveness of our segmentation framework by comparing various similarity metrics.
中文总结:本文提出了一种基于自相关和互相关技术的无参数分割方法,通过采用针对机器人运动特性的相似性度量,能够将复杂的动作示范准确分解为基本运动单元。
English Summary: This paper introduces a parameter-free segmentation method for robot motion using autocorrelation and cross-correlation techniques, enhanced with specialized similarity metrics to accurately break down complex demonstrations into primitive movements.

Authors:Brendan Hertel, Reza Azadeh
Title: Robot Learning Using Multi-Coordinate Elastic Maps
Abstract:
To learn manipulation skills, robots need to understand the features of those skills. An easy way for robots to learn is through Learning from Demonstration (LfD), where the robot learns a skill from an expert demonstrator. While the main features of a skill might be captured in one differential coordinate (i.e., Cartesian), they could have meaning in other coordinates. For example, an important feature of a skill may be its shape or velocity profile, which are difficult to discover in Cartesian differential coordinate. In this work, we present a method which enables robots to learn skills from human demonstrations via encoding these skills into various differential coordinates, then determines the importance of each coordinate to reproduce the skill. We also introduce a modified form of Elastic Maps that includes multiple differential coordinates, combining statistical modeling of skills in these differential coordinate spaces. Elastic Maps, which are flexible and fast to compute, allow for the incorporation of several different types of constraints and the use of any number of demonstrations. Additionally, we propose methods for auto-tuning several parameters associated with the modified Elastic Map formulation. We validate our approach in several simulated experiments and a real-world writing task with a UR5e manipulator arm.
中文摘要:本研究提出一种方法,使机器人能够通过将人类演示技能编码到多种微分坐标系中,确定各坐标系重要性,并采用改进的弹性地图公式,从而实现从演示中高效学习操作技能。
English Summary: This study introduces a method for robots to learn manipulation skills from human demonstrations by encoding them into multiple differential coordinates, determining each coordinate's importance, and utilizing an enhanced Elastic Map formulation for efficient skill reproduction.

Authors:Gabriel Gagné, Anisha Azad, Thomas Labbé, Evan Campbell, Xavier Isabel, Erik Scheme, Ulysse Côté-Allard, Benoit Gosselin
Title: Context Informed Incremental Learning Improves Myoelectric Control Performance in Virtual Reality Object Manipulation Tasks
Abstract:
Electromyography (EMG)-based gesture recognition is a promising approach for designing intuitive human-computer interfaces. However, while these systems typically perform well in controlled laboratory settings, their usability in real-world applications is compromised by declining performance during real-time control. This decline is largely due to goal-directed behaviors that are not captured in static, offline scenarios. To address this issue, we use \textit{Context Informed Incremental Learning} (CIIL) - marking its first deployment in an object-manipulation scenario - to continuously adapt the classifier using contextual cues. Nine participants without upper limb differences completed a functional task in a virtual reality (VR) environment involving transporting objects with life-like grips. We compared two scenarios: one where the classifier was adapted in real-time using contextual information, and the other using a traditional open-loop approach without adaptation. The CIIL-based approach not only enhanced task success rates and efficiency, but also reduced the perceived workload by 7.1 %, despite causing a 5.8 % reduction in offline classification accuracy. This study highlights the potential of real-time contextualized adaptation to enhance user experience and usability of EMG-based systems for practical, goal-oriented applications, crucial elements towards their long-term adoption. The source code for this study is available at: https://github.com/BiomedicalITS/ciil-emg-vr.
中文: 该研究在虚拟现实物体操控任务中首次应用了上下文感知增量学习(CIIL),证明实时调整肌电手势识别系统能提升任务完成效率并降低用户负荷,尽管离线分类准确率略有下降。
English: The study introduces Context Informed Incremental Learning (CIIL) in a VR object-manipulation task, showing that real-time adaptation of EMG-based gesture recognition enhances task success and reduces user workload, despite a slight drop in offline accuracy.

Authors:Congqi Cao, Peiheng Han, Yueran zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning zhang
Title: Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition
Abstract:
Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at https://github.com/Jaulin-Bage/Task-Adapter-pp.
中文: 本文提出Task-Adapter++方法,通过图像编码器的任务适配、大语言模型生成的序列化子动作描述以及细粒度跨模态对齐策略,在少样本动作识别任务中实现了最优性能,并在五个基准测试中取得领先成果。
English: This paper introduces Task-Adapter++, a parameter-efficient dual adaptation method that enhances few-shot action recognition by incorporating task-specific image adaptation, sequential sub-action modeling with LLMs, and a fine-grained cross-modal alignment strategy, achieving state-of-the-art results on multiple benchmarks.

Authors:Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas
Title: Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge
Abstract:
In this technical report, we empirically investigate the relationship between linguistic fluency and domain knowledge in the context of continual learning with large language models (LLMs). Specifically, we enhance the linguistic fluency of the Gemma2 LLM for the Lithuanian language by autoregressively pretraining its full parameter set on the first 10\% of the Lithuanian language component of the CulturaX dataset. To prevent catastrophic forgetting of the model's existing domain knowledge, we apply Elastic Weight Consolidation (EWC), leveraging Fisher information estimated using data from the Massive Multitask Language Understanding (MMLU) benchmark. In the post-training evaluations, we assess linguistic fluency through perplexity and evaluate domain knowledge using accuracy on a suite of language understanding benchmarks, including ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both English and Lithuanian. The empirical results demonstrate that EWC not only mitigates catastrophic forgetting by preserving the model's performance in terms of both linguistic fluency and domain knowledge but also improves or maintains these capabilities for the newly added Lithuanian language. These findings highlight the potential for more efficient adaptation of general-purpose LLMs to under-represented languages without requiring access to the original training data. The accompanying codebase is openly accessible at https://github.com/Neurotechnology/LLM_EWC.
中文摘要:本研究表明,通过全参数预训练结合弹性权重巩固(EWC)方法,在提升Gemma2模型立陶宛语流畅度的同时有效防止了灾难性遗忘,实现了无需原始训练数据即可使通用大语言模型高效适应小语种的能力。
English Summary: This study demonstrates that Elastic Weight Consolidation (EWC) effectively prevents catastrophic forgetting while enhancing the Gemma2 model's Lithuanian language fluency through full-parameter pretraining, enabling efficient adaptation to underrepresented languages without original training data.

Authors:Weihong Li, Xiaoqiong Liu, Heng Fan, Libo Zhang
Title: CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking
Abstract:
Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/Nightwatch-Fox11/CGTrack.
Chinese: CGTrack模型通过引入分层特征级联模块和轻量门控中心头,在提升无人机跟踪网络容量和效率的同时,在多个基准测试中实现了最先进的性能且运行迅速。
English: The CGTrack model introduces a Hierarchical Feature Cascade module and a Lightweight Gated Center Head to enhance UAV tracking by boosting network capacity and efficiency, achieving top performance on benchmarks with high speed.

Authors:Jianjian Yin, Yi Chen, Chengyu Li, Zhichao Zheng, Yanhui Gu, Junsheng Zhou
Title: DFEN: Dual Feature Equalization Network for Medical Image Segmentation
Abstract:
Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclassification of pixels by unequal contextual feature information. In this paper, we propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network, aiming to augment the pixel feature representations by image-level equalization feature information and class-level equalization feature information. Firstly, the image-level feature equalization module is designed to equalize the contextual information of pixels within the image. Secondly, we aggregate regions of the same class to equalize the pixel feature representations of the corresponding class by class-level feature equalization module. Finally, the pixel feature representations are enhanced by learning weights for image-level equalization feature information and class-level equalization feature information. In addition, Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations. We conducted extensive experiments on Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC2017), Automated Cardiac Diagnosis Challenge (ACDC) and PH$^2$ datasets. The experimental results demonstrate that our method have achieved state-of-the-art performance. Our code is publicly available at https://github.com/JianJianYin/DFEN.
中文摘要:本文提出一种基于Swin Transformer与CNN混合架构的双重特征均衡网络,通过图像级和类别级特征均衡增强像素特征表示,在多个医学影像数据集上实现了最先进的性能。
English Summary: This paper introduces a dual feature equalization network combining Swin Transformer and CNN to enhance pixel feature representation through image-level and class-level feature equalization, achieving state-of-the-art performance on multiple medical image datasets.

Authors:Hanzhe Liang, Aoran Wang, Jie Zhou, Xin Jin, Can Gao, Jinbao Wang
Title: Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection
Abstract:
In this paper, we explore a novel approach to 3D anomaly detection (AD) that goes beyond merely identifying anomalies based on structural characteristics. Our primary perspective is that most anomalies arise from unpredictable defective forces originating from both internal and external sources. To address these anomalies, we seek out opposing forces that can help correct them. Therefore, we introduce the Mechanics Complementary Model-based Framework for the 3D-AD task (MC4AD), which generates internal and external corrective forces for each point. We first propose a Diverse Anomaly-Generation (DA-Gen) module designed to simulate various types of anomalies. Next, we present the Corrective Force Prediction Network (CFP-Net), which uses complementary representations for point-level analysis to simulate the different contributions from internal and external corrective forces. To ensure the corrective forces are constrained effectively, we have developed a combined loss function that includes a new symmetric loss and an overall loss. Notably, we implement a Hierarchical Quality Control (HQC) strategy based on a three-way decision process and contribute a dataset titled Anomaly-IntraVariance, which incorporates intraclass variance to evaluate our model. As a result, the proposed MC4AD has been proven effective through theory and experimentation. The experimental results demonstrate that our approach yields nine state-of-the-art performances, achieving optimal results with minimal parameters and the fastest inference speed across five existing datasets, in addition to the proposed Anomaly-IntraVariance dataset. The source is available at https://github.com/hzzzzzhappy/MC4AD
Chinese: 本文提出MC4AD框架,通过生成内外矫正力的力学互补模型解决3D异常检测问题,在多个数据集上以最少参数和最快推理速度实现了最优性能。
English: This paper introduces MC4AD, a novel 3D anomaly detection framework that generates corrective forces to address anomalies through a complementary mechanics model, achieving state-of-the-art performance with minimal parameters and fast inference across multiple datasets.

Authors:Changkun Ye, Russell Tsuchida, Lars Petersson, Nick Barnes
Title: Open Set Label Shift with Test Time Out-of-Distribution Reference
Abstract:
Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class. In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. With reasonable assumptions on the ID/OOD classifier, the estimators are assembled into a sequence of three stages: 1) an estimate of the source label distribution of the OOD class, 2) an EM algorithm for Maximum Likelihood estimates (MLE) of the target label distribution, and 3) an estimate of the target label distribution of OOD class under relaxed assumptions on the OOD classifier. The sampling errors of estimates in 1) and 3) are quantified with a concentration inequality. The estimation result allows us to correct the ID classifier trained on the source distribution to the target distribution without retraining. Experiments on a variety of open set label shift settings demonstrate the effectiveness of our model. Our code is available at https://github.com/ChangkunYe/OpenSetLabelShift.
中文: 本研究提出了一种三阶段估计方法,通过利用分类器在不重新训练的情况下调整源分布训练的模型以适应目标分布,有效解决了开放集标签偏移问题,并在多种实验场景中得到验证。
English: This work develops a three-stage estimation method to address open set label shift by leveraging classifiers to adjust source-trained models for target distributions without retraining, validated through diverse experimental settings.

Authors:Chunlai Dong, Haochao Ying, Qibo Qiu, Jinhong Wang, Danny Chen, Jian Wu
Title: Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression
Abstract:
Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at https://github.com/ZJUMAI/DFPG-ord.
中文摘要:提出的DFPG框架通过双重模糊学习与局部引导,利用图像级标签提取局部特征,有效处理标记模糊性并提升困难类别的分类性能。
English Summary: The proposed DFPG framework introduces dual-level fuzzy learning with patch guidance to address ordinal regression by leveraging patch-level features from image-level labels, effectively handling label ambiguity and improving classification of challenging categories.

Authors:Zhiyuan Chen, Keyi Li, Yifan Jia, Le Ye, Yufei Ma
Title: Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
Abstract:
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at https://github.com/ccccczzy/icc.
中文: 本文提出了一种无需训练的扩散变换器加速方法——增量校准缓存,通过低秩近似和通道感知SVD在减少45%以上计算量的同时保持图像生成质量。
English: This paper introduces a training-free acceleration method for diffusion transformers called increment-calibrated caching, which uses low-rank approximation and channel-aware SVD to reduce computations by over 45% while maintaining image quality.

Authors:Yizhuo Yang, Jiulin Zhao, Xinhang Xu, Kun Cao, Shenghai Yuan, Lihua Xie
Title: Unsupervised Anomaly Detection for Autonomous Robots via Mahalanobis SVDD with Audio-IMU Fusion
Abstract:
Reliable anomaly detection is essential for ensuring the safety of autonomous robots, particularly when conventional detection systems based on vision or LiDAR become unreliable in adverse or unpredictable conditions. In such scenarios, alternative sensing modalities are needed to provide timely and robust feedback. To this end, we explore the use of audio and inertial measurement unit (IMU) sensors to detect underlying anomalies in autonomous mobile robots, such as collisions and internal mechanical faults. Furthermore, to address the challenge of limited labeled anomaly data, we propose an unsupervised anomaly detection framework based on Mahalanobis Support Vector Data Description (M-SVDD). In contrast to conventional SVDD methods that rely on Euclidean distance and assume isotropic feature distributions, our approach employs the Mahalanobis distance to adaptively scale feature dimensions and capture inter-feature correlations, enabling more expressive decision boundaries. In addition, a reconstruction-based auxiliary branch is introduced to preserve feature diversity and prevent representation collapse, further enhancing the robustness of anomaly detection. Extensive experiments on a collected mobile robot dataset and four public datasets demonstrate the effectiveness of the proposed method, as shown in the video https://youtu.be/yh1tn6DDD4A. Code and dataset are available at https://github.com/jamesyang7/M-SVDD.
中文摘要:本研究提出了一种基于音频和IMU传感器的无监督异常检测框架,通过马氏距离优化特征尺度并引入重构分支,有效提升了自主机器人在碰撞和机械故障检测中的鲁棒性。
English Summary: This study introduces an unsupervised anomaly detection framework using audio and IMU sensors for autonomous robots, employing Mahalanobis distance to improve feature scaling and incorporating a reconstruction branch to enhance robustness against collisions and mechanical faults.

Authors:Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin
Title: MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Abstract:
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.
Chinese: MxMoE是一种混合精度优化框架,通过综合考虑算法敏感性和系统动态,有效解决了专家混合模型的部署难题,并在性能与效率上超越了现有方法。
English: MxMoE is a mixed-precision optimization framework that addresses deployment challenges in Mixture-of-Experts models by considering algorithmic sensitivity and system dynamics, achieving superior performance and efficiency over existing methods.

Authors:Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
Title: APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
Abstract:
Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with LLMs remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proofgeneration results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sublemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low budget. The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts. On the miniF2F benchmark, we establish a new stateoftheart accuracy of 84.9% among sub 8Bparameter models while keeping the sampling budget below one hundred. Moreover, Apollo raises the stateoftheart accuracy for GoedelProverSFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. Generalpurpose models (o3mini, o4mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compilerguided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO
中文: APOLLO是一种自动化证明修复流程,通过结合大型语言模型与Lean编译器,以低采样成本高效生成和修正数学证明,实现了顶尖的准确性。
English: APOLLO is an automated proof repair pipeline that integrates LLMs with the Lean compiler to efficiently generate and correct mathematical proofs, achieving state-of-the-art accuracy with minimal sampling.

Authors:Amin Ghafourian, Andrew Lee, Dechen Gao, Tyler Beer, Kin Yen, Iman Soltani
Title: Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data
Abstract:
Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometric and signal processing techniques, to automate surveying tasks. As a proof of concept, we apply this framework to automatically evaluate the compliance of curb ramps with the Americans with Disabilities Act (ADA), demonstrating the utility of point cloud data in survey automation. The method leverages a newly collected, large annotated dataset of curb ramps, made publicly available as part of this work, to facilitate robust model training and evaluation. Experimental results, including comparison with manual field measurements of several ramps, validate the accuracy and reliability of the proposed method, highlighting its potential to significantly reduce manual effort and improve consistency in infrastructure assessment. Beyond ADA compliance, the proposed framework lays the groundwork for broader applications in infrastructure surveying and automated construction evaluation, promoting wider adoption of point cloud data in these domains. The annotated database, manual ramp survey data, and developed algorithms are publicly available on the project's GitHub page: https://github.com/Soltanilara/SurveyAutomation.
中文: 本文提出了一种利用点云数据和深度学习的自动化框架,通过评估ADA路缘坡道合规性展示了其在基础设施检测中的高效性,显著减少了人工工作量并提高了评估一致性。
English: This paper introduces an automated framework using point cloud data and deep learning to efficiently assess infrastructure compliance, demonstrated through ADA curb ramp evaluations, which reduces manual effort and enhances assessment consistency.

Authors:Zhangchi Hu, Peixi Wu, Jie Chen, Huyue Zhu, Yijun Wang, Yansong Peng, Hebei Li, Xiaoyan Sun
Title: Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection
Abstract:
Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code is available at https://github.com/RicePasteM/Dome-DETR.
Chinese: Dome-DETR提出了一种密度导向的特征查询操作框架,通过轻量级提取器和自适应查询初始化,在保持计算效率的同时实现了最先进的微小目标检测性能。
English: Dome-DETR introduces a density-oriented feature-query manipulation framework with a lightweight extractor and adaptive query initialization, achieving state-of-the-art tiny object detection performance while maintaining computational efficiency.

Authors:Jinze Lv, Jian Chen, Zi Long, Xianghua Fu, Yin Chen
Title: TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries
Abstract:
Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD
中文: 现有数据集缺乏多样化视频内容,因此我们构建了TopicVD纪录片数据集,通过主题分类和跨模态注意力模型验证了视觉信息提升翻译效果,但跨领域适应性仍需改进。
English: Current multimodal machine translation datasets lack diverse video content, so we created TopicVD, a documentary-focused dataset with categorized topics and a cross-modal attention model, which shows visual data enhances translation but struggles with domain shifts.

Authors:Ho-Joong Kim, Yearang Lee, Jung-Ho Hong, Seong-Whan Lee
Title: DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer
Abstract:
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: https://github.com/Dotori-HJ/DiGIT
中文: 本文提出DiGIT模型,通过多扩张门控编码器减少多尺度特征冗余,并结合中心-邻近区域集成解码器增强时序上下文捕捉能力,在多个基准测试中实现了最先进的时序动作检测性能。
English: This paper introduces DiGIT, a novel transformer-based model for temporal action detection that addresses limitations in existing query-based detectors by replacing standard components with a multi-dilated gated encoder to reduce feature redundancy and a central-adjacent region integrated decoder for enhanced temporal context capture, achieving state-of-the-art performance on multiple benchmarks.

Authors:Guilherme Vieira Neto, Marcos Eduardo Valle
Title: V-EfficientNets: Vector-Valued Efficiently Scaled Convolutional Neural Network Models
Abstract:
EfficientNet models are convolutional neural networks optimized for parameter allocation by jointly balancing network width, depth, and resolution. Renowned for their exceptional accuracy, these models have become a standard for image classification tasks across diverse computer vision benchmarks. While traditional neural networks learn correlations between feature channels during training, vector-valued neural networks inherently treat multidimensional data as coherent entities, taking for granted the inter-channel relationships. This paper introduces vector-valued EfficientNets (V-EfficientNets), a novel extension of EfficientNet designed to process arbitrary vector-valued data. The proposed models are evaluated on a medical image classification task, achieving an average accuracy of 99.46% on the ALL-IDB2 dataset for detecting acute lymphoblastic leukemia. V-EfficientNets demonstrate remarkable efficiency, significantly reducing parameters while outperforming state-of-the-art models, including the original EfficientNet. The source code is available at https://github.com/mevalle/v-nets.
中文摘要:V-EfficientNets将EfficientNet的参数优化扩展至向量值数据,在白血病检测中以更少参数实现99.46%准确率,超越现有最优模型。
English Summary: V-EfficientNets extend EfficientNet's parameter optimization to vector-valued data, achieving 99.46% accuracy in leukemia detection with fewer parameters than state-of-the-art models.

Authors:Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury
Title: ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior
Abstract:
Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos are provided at: https://arraydps.github.io/ArrayDPSDemo/.
Chinese: ArrayDPS是一种无监督、阵列无关的生成方法,通过扩散后验采样近似房间声学来解决盲语音分离问题,在分离质量上超越了所有无监督基准方法,并可与监督方法相媲美。
English: ArrayDPS is an unsupervised, array-agnostic generative method that solves blind speech separation by approximating room acoustics through diffusion posterior sampling, outperforming unsupervised baselines and matching supervised methods in separation quality.

Authors:Tien Dang, Truong-Son Hy
Title: EquiHGNN: Scalable Rotationally Equivariant Hypergraph Neural Networks
Abstract:
Molecular interactions often involve high-order relationships that cannot be fully captured by traditional graph-based models limited to pairwise connections. Hypergraphs naturally extend graphs by enabling multi-way interactions, making them well-suited for modeling complex molecular systems. In this work, we introduce EquiHGNN, an Equivariant HyperGraph Neural Network framework that integrates symmetry-aware representations to improve molecular modeling. By enforcing the equivariance under relevant transformation groups, our approach preserves geometric and topological properties, leading to more robust and physically meaningful representations. We examine a range of equivariant architectures and demonstrate that integrating symmetry constraints leads to notable performance gains on large-scale molecular datasets. Experiments on both small and large molecules show that high-order interactions offer limited benefits for small molecules but consistently outperform 2D graphs on larger ones. Adding geometric features to these high-order structures further improves the performance, emphasizing the value of spatial information in molecular learning. Our source code is available at https://github.com/HySonLab/EquiHGNN/
中文摘要:EquiHGNN是一种等变超图神经网络,通过引入对称性约束和高阶相互作用来增强分子建模,在结合几何特征后对较大分子的处理表现尤为突出。
English Summary: EquiHGNN is an equivariant hypergraph neural network that leverages symmetry constraints and high-order interactions to enhance molecular modeling, showing significant improvements especially for larger molecules when geometric features are incorporated.

Authors:Mohamed-Khalil Bouzidi, Christian Schlauch, Nicole Scheuerer, Yue Yao, Nadja Klein, Daniel Göhring, Jörg Reichardt
Title: Closing the Loop: Motion Prediction Models beyond Open-Loop Benchmarks
Abstract:
Fueled by motion prediction competitions and benchmarks, recent years have seen the emergence of increasingly large learning based prediction models, many with millions of parameters, focused on improving open-loop prediction accuracy by mere centimeters. However, these benchmarks fail to assess whether such improvements translate to better performance when integrated into an autonomous driving stack. In this work, we systematically evaluate the interplay between state-of-the-art motion predictors and motion planners. Our results show that higher open-loop accuracy does not always correlate with better closed-loop driving behavior and that other factors, such as temporal consistency of predictions and planner compatibility, also play a critical role. Furthermore, we investigate downsized variants of these models, and, surprisingly, find that in some cases models with up to 86% fewer parameters yield comparable or even superior closed-loop driving performance. Our code is available at https://github.com/continental/pred2plan.
中文: 近期大型运动预测模型在开环精度上仅有微小提升,但无法转化为自动驾驶性能的改善,因为时间一致性和规划器兼容性等因素更为关键,且精简版模型有时反而表现更优。
English: Recent large motion prediction models show minimal open-loop accuracy gains, but fail to improve autonomous driving performance, as factors like temporal consistency and planner compatibility prove more critical, with downsized models sometimes outperforming larger ones.

Authors:Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, Yong Li
Title: CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory
Abstract:
Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbf{CityNavAgent}, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \href{https://github.com/VinceOuti/CityNavAgent}{link}.
中文摘要:CityNavAgent是一种基于大语言模型的空中导航系统,通过分层任务分解和记忆模块简化无人机在复杂城市环境中的导航,实现了最先进的性能。
English Summary: CityNavAgent is an LLM-powered aerial navigation system that simplifies drone navigation in complex urban settings by hierarchically decomposing tasks and utilizing a memory module, achieving state-of-the-art performance.

Authors:Seraj Al Mahmud Mostafa, Chenxi Wang, Jia Yue, Yuta Hozumi, Jianwu Wang
Title: Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling
Abstract:
Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization.
中文: 本研究提出的YOLO-DCAP改进模型通过多尺度空洞残差卷积和注意力空间池化模块,在三种卫星数据集上实现显著性能提升,相比基准模型mAP50提高20.95%、IoU提升32.23%,展现了卓越的物体定位能力。
English: This research introduces YOLO-DCAP, an enhanced YOLOv5 model with multi-scale dilated convolution and attention mechanisms, which significantly improves object localization in satellite imagery by achieving over 20% higher mAP50 and 32% better IoU than base models across three challenging datasets.

Authors:Qianbo Zang, Christophe Zgrzendek, Igor Tchappi, Afshin Khadangi, Johannes Sedlmeir
Title: KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification
Abstract:
Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.
中文: 本文提出KG-HTC方法,通过将知识图谱与大语言模型结合,为零样本分层文本分类提供结构化语义上下文,有效解决了大规模标签空间和长尾分布问题,实验表明其性能显著优于基线模型。
English: This paper introduces KG-HTC, a zero-shot hierarchical text classification method that integrates knowledge graphs with large language models to address challenges like large label spaces and long-tail distributions by providing structured semantic context, significantly outperforming baselines in experiments.

Authors:Mikhail Chaichuk, Sushant Gautam, Steven Hicks, Elena Tutubalina
Title: Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffusion Models
Abstract:
The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI while preserving patient privacy. This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches: (1) fine-tuning large pre-trained latent diffusion models and (2) training small, domain-specific models. We introduce a novel model named MSDM, an optimized architecture based on Stable Diffusion that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms to better align medical text prompts with generated images. Our study compares two approaches: fine-tuning large pre-trained models (FLUX, Kandinsky) versus training compact domain-specific models (MSDM). Evaluation across colonoscopy (MedVQA-GI) and radiology (ROCOv2) datasets reveals that while large models achieve higher fidelity, our optimized MSDM delivers comparable quality with lower computational costs. Quantitative metrics and qualitative evaluations by medical experts reveal strengths and limitations of each approach.
中文: 本研究比较了医学影像领域的文本到图像生成方法,发现微调大型模型可获得更高保真度,而新型MSDM模型能以更低计算成本实现相当质量。
English: This study compares text-to-image synthesis methods in medical imaging, showing that fine-tuning large models achieves higher fidelity while the novel MSDM model offers comparable quality with reduced computational costs.

Authors:Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang
Title: Griffin: Towards a Graph-Centric Relational Database Foundation Model
Abstract:
We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs. Code available at https://github.com/yanxwb/Griffin.
Chinese: Griffin是首个专为关系数据库设计的基础模型,通过统一数据编码和任务解码机制及增强架构,在多样化任务中展现出卓越性能,并在跨数据集迁移学习中表现出强大的适应性。
English: Griffin is the first foundation model for relational databases, unifying data encoding and task decoding to handle diverse tasks with enhanced architecture and pretraining, demonstrating superior performance and strong transferability across various datasets.

Authors:Md Kamrujjaman Mobin, Md Saiful Islam, Sadik Al Barid, Md Masum
Title: Cardioformer: Advancing AI in ECG Analysis with Multi-Granularity Patching and ResNet
Abstract:
Electrocardiogram (ECG) classification is crucial for automated cardiac disease diagnosis, yet existing methods often struggle to capture local morphological details and long-range temporal dependencies simultaneously. To address these challenges, we propose Cardioformer, a novel multi-granularity hybrid model that integrates cross-channel patching, hierarchical residual learning, and a two-stage self-attention mechanism. Cardioformer first encodes multi-scale token embeddings to capture fine-grained local features and global contextual information and then selectively fuses these representations through intra- and inter-granularity self-attention. Extensive evaluations on three benchmark ECG datasets under subject-independent settings demonstrate that model consistently outperforms four state-of-the-art baselines. Our Cardioformer model achieves the AUROC of 96.34$\pm$0.11, 89.99$\pm$0.12, and 95.59$\pm$1.66 in MIMIC-IV, PTB-XL and PTB dataset respectively outperforming PatchTST, Reformer, Transformer, and Medformer models. It also demonstrates strong cross-dataset generalization, achieving 49.18% AUROC on PTB and 68.41% on PTB-XL when trained on MIMIC-IV. These findings underscore the potential of Cardioformer to advance automated ECG analysis, paving the way for more accurate and robust cardiovascular disease diagnosis. We release the source code at https://github.com/KMobin555/Cardioformer.
Chinese: Cardioformer是一种创新的混合模型,能有效捕捉心电图信号的局部形态特征和长程时间依赖性,在多个数据集上持续超越现有最优基准模型,并展现出强大的跨数据集泛化能力。
English: Cardioformer is a novel hybrid model that effectively captures both local morphological details and long-range temporal dependencies in ECG signals, consistently outperforming state-of-the-art baselines across multiple datasets and demonstrating strong cross-dataset generalization capabilities.

Authors:Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Title: Low-bit Model Quantization for Deep Neural Networks: A Survey
Abstract:
With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conversion from continuous floating-point numbers to discrete integer ones, which significantly speeds up the memory I/O and calculation, i.e., addition and multiplication. However, performance degradation also comes with the conversion because of the loss of precision. Therefore, it has become increasingly popular and critical to investigate how to perform the conversion and how to compensate for the information loss. This article surveys the recent five-year progress towards low-bit quantization on DNNs. We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques. Furthermore, we shed light on the potential research opportunities in the field of model quantization. A curated list of model quantization is provided at https://github.com/Kai-Liu001/Awesome-Model-Quantization.
中文: 深度神经网络因高计算成本和大模型尺寸面临部署挑战,模型量化通过将浮点数转换为整数来加速性能,成为关键技术,但需在速度提升与精度损失间取得平衡。
English: Deep neural networks face deployment challenges due to high computational costs and large model sizes, making model quantization a crucial technique that accelerates performance by converting floating-point numbers to integers, though it requires balancing speed gains with precision loss.

Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Title: X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Abstract:
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.
中文: 本文提出X-Transfer攻击方法,通过动态代理缩放生成具有超级迁移性的通用对抗扰动,能高效地跨任务、跨领域破坏CLIP模型。
English: The paper introduces X-Transfer, a novel attack method that generates a universal adversarial perturbation with super transferability, efficiently compromising CLIP models across various tasks and domains through dynamic surrogate scaling.

Authors:Zinan Liu, Haoran Li, Jingyi Lu, Gaoyuan Ma, Xu Hong, Giovanni Iacca, Arvind Kumar, Shaojun Tang, Lin Wang
Title: Nature's Insight: A Novel Framework and Comprehensive Analysis of Agentic Reasoning Through the Lens of Neuroscience
Abstract:
Autonomous AI is no longer a hard-to-reach concept, it enables the agents to move beyond executing tasks to independently addressing complex problems, adapting to change while handling the uncertainty of the environment. However, what makes the agents truly autonomous? It is agentic reasoning, that is crucial for foundation models to develop symbolic logic, statistical correlations, or large-scale pattern recognition to process information, draw inferences, and make decisions. However, it remains unclear why and how existing agentic reasoning approaches work, in comparison to biological reasoning, which instead is deeply rooted in neural mechanisms involving hierarchical cognition, multimodal integration, and dynamic interactions. In this work, we propose a novel neuroscience-inspired framework for agentic reasoning. Grounded in three neuroscience-based definitions and supported by mathematical and biological foundations, we propose a unified framework modeling reasoning from perception to action, encompassing four core types, perceptual, dimensional, logical, and interactive, inspired by distinct functional roles observed in the human brain. We apply this framework to systematically classify and analyze existing AI reasoning methods, evaluating their theoretical foundations, computational designs, and practical limitations. We also explore its implications for building more generalizable, cognitively aligned agents in physical and virtual environments. Finally, building on our framework, we outline future directions and propose new neural-inspired reasoning methods, analogous to chain-of-thought prompting. By bridging cognitive neuroscience and AI, this work offers a theoretical foundation and practical roadmap for advancing agentic reasoning in intelligent systems. The associated project can be found at: https://github.com/BioRAILab/Awesome-Neuroscience-Agent-Reasoning .
中文: 自主人工智能通过代理推理实现真正独立,本研究提出受神经科学启发的框架,模拟从感知到行动的推理过程,连接认知神经科学与人工智能,以提升智能系统的能力。
English: Autonomous AI agents achieve true independence through agentic reasoning, which this study advances by proposing a neuroscience-inspired framework that models reasoning from perception to action, bridging cognitive neuroscience and AI to enhance intelligent systems.

Authors:Thomas Sommariva, Simone Calderara, Angelo Porrello
Title: How to Train Your Metamorphic Deep Neural Network
Abstract:
Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.
中文: 本研究提出了一种训练算法,扩展了神经变形能力,实现全网络变形且精度损失最小,从而能够跨多种配置可扩展地生成压缩模型。
English: This work introduces a training algorithm that extends Neural Metamorphosis to enable full-network metamorphosis with minimal accuracy loss, allowing scalable generation of compressed models across various configurations.

Authors:Yiming Qin, Zhu Xu, Yang Liu
Title: Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
Abstract:
Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong attribute binding in generated results. (2) Occluded object parts demand a disciplined generation order and explicit part disentanglement. Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). It leverages a large language model to decompose the long description into blocks representing different object parts, and orders them from inside out according to occlusions, forming a hierarchical chain. Within each block we first coarsely create components, then precisely bind attributes via target-region localization and corresponding 3D Gaussian kernel optimization. Between blocks, we introduce Gaussian Extension and Label Elimination to seamlessly generate new parts by extending new Gaussian kernels, re-assigning semantic labels, and eliminating unnecessary kernels, ensuring that only relevant parts are added without disrupting previously optimized parts. Experiments confirm that HCoG yields structurally coherent, attribute-faithful 3D objects with complex attributes. The code is available at https://github.com/Wakals/GASCOL .
中文: 现有文本到3D模型因文本编码器对长描述理解不足及遮挡部分处理困难,在复杂属性对象生成上存在局限;为此提出的HCoG方法通过大语言模型分解描述并优化3D高斯核,实现了结构连贯且属性准确的自动生成。
English: Current text-to-3D models struggle with complex attributes due to text encoder limitations in understanding long descriptions and handling occluded parts, prompting the development of HCoG, an automated method that uses a large language model to decompose descriptions and optimize 3D Gaussian kernels for coherent and faithful object generation.

Authors:Xingyu Jiang, Ning Gao, Xiuhui Zhang, Hongkun Dou, Shaowen Fu, Xiaoqing Zhong, Hongjue Li, Yue Deng
Title: Image Restoration via Multi-domain Learning
Abstract:
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: https://github.com/deng-ai-lab/SWFormer.
中文: 本文提出了一种新颖的图像恢复框架,将多领域学习融入Transformer中,通过多感受野结构替代自注意力机制并结合多尺度学习,在十项恢复任务中实现卓越性能与高效平衡。
English: This paper introduces a novel image restoration framework that integrates multi-domain learning into Transformer, replacing self-attention with a multi-receptive field structure and incorporating multi-scale learning to achieve superior performance across ten restoration tasks with balanced efficiency.

Authors:Yunfan Lu, Xiaogang Xu, Pengteng Li, Yusheng Wang, Yi Cui, Huizai Yao, Hui Xiong
Title: From Events to Enhancement: A Survey on Event-Based Imaging Technologies
Abstract:
Event cameras offering high dynamic range and low latency have emerged as disruptive technologies in imaging. Despite growing research on leveraging these benefits for different imaging tasks, a comprehensive study of recently advances and challenges are still lacking. This limits the broader understanding of how to utilize events in universal imaging applications. In this survey, we first introduce a physical model and the characteristics of different event sensors as the foundation. Following this, we highlight the advancement and interaction of image/video enhancement tasks with events. Additionally, we explore advanced tasks, which capture richer light information with events, \eg~light field estimation, multi-view generation, and photometric. Finally, we discuss new challenges and open questions offering a perspective for this rapidly evolving field. More continuously updated resources are at this link: https://github.com/yunfanLu/Awesome-Event-Imaging
Chinese: 本综述全面探讨了事件相机的物理模型、成像任务的进展及新挑战,强调了其在通用成像应用中的潜力。
English: This survey provides a comprehensive overview of event cameras' physical models, advancements in imaging tasks, and emerging challenges, highlighting their potential in universal imaging applications.

Authors:Beichen Wen, Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Title: 3D Scene Generation: A Survey
Abstract:
3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/hzxie/Awesome-3D-Scene-Generation.
中文: 本综述系统梳理了三维场景生成的前沿方法,将其归纳为四种技术范式并分析其理论基础、性能权衡及应用场景,同时展望了高保真度、物理感知生成等未来方向。
English: This survey systematically reviews state-of-the-art 3D scene generation methods, categorizing them into four paradigms and analyzing their technical foundations, trade-offs, and applications, while highlighting future directions like higher fidelity and physics-aware generation.

Authors:Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
Title: Flow-GRPO: Training Flow Matching Models via Online RL
Abstract:
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
Chinese: Flow-GRPO首次将在线强化学习融入流匹配模型,通过ODE到SDE转换和降噪减少策略提高采样效率,在文本到图像任务中显著提升性能,同时未牺牲图像质量或多样性。
English: Flow-GRPO introduces the first online reinforcement learning integration into flow matching models, employing ODE-to-SDE conversion and Denoising Reduction strategies to enhance sampling efficiency and achieve significant performance improvements in text-to-image tasks without compromising image quality or diversity.

Authors:Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Abstract:
Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
中文摘要:本研究通过模型融合实现了大型语言模型推理能力向视觉语言模型的无需训练迁移,并揭示了感知功能主要分布于模型早期层,而融合后推理能力则扩展至所有层的工作机制。
English Summary: This study demonstrates that model merging effectively transfers reasoning capabilities from Large Language Models to Vision-Language Models without requiring training, while revealing that perception functions are concentrated in early layers and reasoning emerges across all layers after integration.

Authors:Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger
Title: LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering
Abstract:
The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LiTransProQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LiTransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LiTransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LiTransProQA reaches human-level evaluation performance comparable to trained student evaluators. It shows broad applicability to open-source models like LLaMa3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations. The code and datasets are available under: https://github.com/zhangr2021/TransProQA.
中文摘要:LiTransProQA是一种基于大语言模型的无参考评估框架,通过整合专业译者见解来评估文学翻译质量,其表现超越现有指标并达到人类评估水平。
English Summary: LiTransProQA is a novel, reference-free LLM-based framework that integrates professional translator insights to evaluate literary translations, outperforming existing metrics and achieving human-level assessment performance.

Authors:Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Title: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Abstract:
Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.
中文:TokLIP是一种创新的视觉分词器,通过将高层语义融入矢量量化标记并支持高效的端到端训练,显著提升了多模态理解能力,在理解和生成任务中均实现了卓越的数据效率和性能表现。
English: TokLIP is a novel visual tokenizer that enhances multimodal comprehension by integrating high-level semantics into vector-quantized tokens while enabling efficient end-to-end training, achieving superior data efficiency and performance in both understanding and generation tasks.

Authors:Sooyoung Park, Arda Senocak, Joon Son Chung
Title: Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Abstract:
Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatible with CLIP's text encoder, producing audio-driven embeddings. These embeddings are used to generate sounding region masks, from which visual features are extracted and aligned with the audio embeddings through a contrastive audio-visual correspondence objective. Our findings show that alignment knowledge of pre-trained multimodal foundation model enables our method to generate more complete and compact localization for sounding objects. We further propose an LLM-guided extension that distills object-aware audio-visual scene understanding into the model during training to enhance alignment. Extensive experiments across five diverse tasks demonstrate that our method, in all variants, outperforms state-of-the-art approaches and achieves strong generalization in zero-shot settings.
中文: 本研究将CLIP扩展至声源定位,通过自监督框架将音频转换为CLIP兼容的嵌入表示,生成发声区域掩码并通过对比学习对齐视听特征,结合LLM引导的扩展增强对象感知的场景理解能力,在多项任务中超越现有最优方法并展现零样本泛化优势。
English: This study extends CLIP to sound source localization through a self-supervised framework that converts audio into CLIP-compatible embeddings, generating sounding object masks and aligning visual features with audio via contrastive learning, enhanced by an LLM-guided extension for improved object-aware scene understanding, outperforming state-of-the-art methods across multiple tasks.

Authors:Thevathayarajh Thayananthan, Xin Zhang, Yanbo Huang, Jingdao Chen, Nuwan K. Wijewardane, Vitor S. Martins, Gary D. Chesser, Christopher T. Goodin
Title: CottonSim: Development of an autonomous visual-guided robotic cotton-picking system in the Gazebo
Abstract:
In this study, an autonomous visual-guided robotic cotton-picking system, built on a Clearpath's Husky robot platform and the Cotton-Eye perception system, was developed in the Gazebo robotic simulator. Furthermore, a virtual cotton farm was designed and developed as a Robot Operating System (ROS 1) package to deploy the robotic cotton picker in the Gazebo environment for simulating autonomous field navigation. The navigation was assisted by the map coordinates and an RGB-depth camera, while the ROS navigation algorithm utilized a trained YOLOv8n-seg model for instance segmentation. The model achieved a desired mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a precision of 93.0% for scene segmentation. The developed ROS navigation packages enabled our robotic cotton-picking system to autonomously navigate through the cotton field using map-based and GPS-based approaches, visually aided by a deep learning-based perception system. The GPS-based navigation approach achieved a 100% completion rate (CR) with a threshold of 5 x 10^-6 degrees, while the map-based navigation approach attained a 96.7% CR with a threshold of 0.25 m. This study establishes a fundamental baseline of simulation for future agricultural robotics and autonomous vehicles in cotton farming and beyond. CottonSim code and data are released to the research community via GitHub: https://github.com/imtheva/CottonSim
Chinese: 本研究提出了一种轻量级视觉引导自主采棉机器人,以解决美国棉花采收中的可持续性问题,在仿真环境中通过GPS和地图导航系统实现了高精度作业。
English: This study introduces a lightweight, vision-guided autonomous robotic cotton picker to address sustainability challenges in U.S. cotton harvesting, achieving high precision in simulation with GPS and map-based navigation systems.

Authors:Thevathayarajh Thayananthan, Xin Zhang, Yanbo Huang, Jingdao Chen, Nuwan K. Wijewardane, Vitor S. Martins, Gary D. Chesser, Christopher T. Goodin
Title: CottonSim: A vision-guided autonomous robotic system for cotton harvesting in Gazebo simulation
Abstract:
Cotton is a major cash crop in the United States, with the country being a leading global producer and exporter. Nearly all U.S. cotton is grown in the Cotton Belt, spanning 17 states in the southern region. Harvesting remains a critical yet challenging stage, impacted by the use of costly, environmentally harmful defoliants and heavy, expensive cotton pickers. These factors contribute to yield loss, reduced fiber quality, and soil compaction, which collectively threaten long-term sustainability. To address these issues, this study proposes a lightweight, small-scale, vision-guided autonomous robotic cotton picker as an alternative. An autonomous system, built on Clearpath's Husky platform and integrated with the CottonEye perception system, was developed and tested in the Gazebo simulation environment. A virtual cotton field was designed to facilitate autonomous navigation testing. The navigation system used Global Positioning System (GPS) and map-based guidance, assisted by an RGBdepth camera and a YOLOv8nseg instance segmentation model. The model achieved a mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a precision of 93.0%. The GPS-based approach reached a 100% completion rate (CR) within a $(5e-6)^{\circ}$ threshold, while the map-based method achieved a 96.7% CR within a 0.25 m threshold. The developed Robot Operating System (ROS) packages enable robust simulation of autonomous cotton picking, offering a scalable baseline for future agricultural robotics. CottonSim code and datasets are publicly available on GitHub: https://github.com/imtheva/CottonSim
Chinese: 本研究提出了一种轻量级视觉引导自主采棉机器人,以解决美国棉花采收中的可持续性问题,在仿真环境中通过GPS和地图导航系统实现了高精度作业。
English: This study introduces a lightweight, vision-guided autonomous robotic cotton picker to address sustainability challenges in U.S. cotton harvesting, achieving high precision in simulation with GPS and map-based navigation systems.

Authors:Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong
Title: Scalable Chain of Thoughts via Elastic Reasoning
Abstract:
Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritizes the completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Our code has been made available at https://github.com/SalesforceAIResearch/Elastic-Reasoning.
中文摘要:弹性推理是一种新颖框架,将推理过程划分为思维和解答两个独立预算阶段,使大型推理模型能够在严格计算限制下保持稳健性能,同时提高效率和简洁性。
English Summary: Elastic Reasoning is a novel framework that divides reasoning into thinking and solution phases with independent budgets, enabling large reasoning models to perform robustly under strict computational constraints while improving efficiency and conciseness.

Authors:Yifan Bian, Chuanbo Tang, Li Li, Dong Liu
Title: Augmented Deep Contexts for Spatially Embedded Video Coding
Abstract:
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.
中文摘要:本研究提出的空间嵌入视频编解码器(SEVC)通过整合空间参考和混合上下文,克服了纯时序神经视频编解码器的局限性,在提供额外低码率码流的同时实现了11.9%的码率降低和更优的率失真性能。
English Summary: The proposed Spatially Embedded Video Codec (SEVC) overcomes limitations of temporal-only neural video codecs by integrating spatial references and hybrid contexts, achieving superior rate-distortion performance with 11.9% bitrate reduction while providing an additional low-resolution stream.

Authors:You Peng, Youhe Jiang, Chen Wang, Binhang Yuan
Title: HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow
Abstract:
Recent advances in leveraging the agentic paradigm of large language models (LLMs) utilization have significantly enhanced Text-to-SQL capabilities, enabling users without specialized database expertise to query data intuitively. However, deploying these agentic LLM-based Text-to-SQL systems in production poses substantial challenges due to their inherently multi-stage workflows, stringent latency constraints, and potentially heterogeneous GPU infrastructure in enterprise environments. Current LLM serving frameworks lack effective mechanisms for handling interdependent inference tasks, dynamic latency variability, and resource heterogeneity, leading to suboptimal performance and frequent service-level objective (SLO) violations. In this paper, we introduce HEXGEN-TEXT2SQL, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters that handle multi-tenant end-to-end queries. HEXGEN-TEXT2SQL introduce a hierarchical scheduling approach combining global workload-balanced task dispatching and local adaptive urgency-guided prioritization, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our extensive evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-TEXT2SQL significantly outperforms state-of-the-art LLM serving frameworks. Specifically, HEXGEN-TEXT2SQL reduces latency deadlines by up to 1.67$\times$ (average: 1.41$\times$) and improves system throughput by up to 1.75$\times$ (average: 1.65$\times$) compared to vLLM under diverse, realistic workload conditions. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.
Chinese: 基于智能体范式的大语言模型文本转SQL系统面临多阶段工作流程和异构基础设施的部署挑战,HEXGEN-TEXT2SQL通过分层调度和自适应优先级机制,显著降低了延迟并提高了系统吞吐量。
English: Recent advances in agentic LLM-based Text-to-SQL systems face deployment challenges due to multi-stage workflows and heterogeneous infrastructure, which HEXGEN-TEXT2SQL addresses through hierarchical scheduling and adaptive prioritization to significantly reduce latency and improve throughput.

Authors:Mengze Hong, Wailing Ng, Chen Jason Zhang, Di Jiang
Title: QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
Abstract:
The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98% reveals the current gaps in domain coverage within model capabilities. Furthermore, we identify performance degradation caused by LLM crowdsourcing, assess data contamination, and illustrate the effectiveness of prompt engineering and model fine-tuning, suggesting opportunities for future improvements through multi-domain RAG and Federated Learning.
中文摘要:QualBench是首个基于中国资格考试的多领域中文问答基准,通过评估发现中文大模型在本地化知识方面优于非中文模型,并揭示了通过提示工程等技术提升性能的改进空间。
English Summary: QualBench is the first multi-domain Chinese QA benchmark using qualification exams to evaluate localized knowledge of Chinese LLMs, revealing their superiority over non-Chinese models and identifying key areas for improvement through techniques like prompt engineering.

Authors:Qian Zeng, Chenggong Hu, Mingli Song, Jie Song
Title: Diffusion Model Quantization: A Review
Abstract:
Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage https://github.com/TaylorJocelyn/Diffusion-Model-Quantization.
中文摘要:本综述全面探讨了扩散模型量化技术的最新进展,系统分析了量化方法的性能表现与视觉影响,并指出了该领域未来的研究方向。
English Summary: This survey comprehensively reviews recent advances in diffusion model quantization for efficient deployment on edge devices, analyzing both quantitative performance and qualitative effects while outlining future research directions.

Authors:Wangkun Xu, Zhongda Chu, Fei Teng
Title: LAPSO: A Unified Optimization View for Learning-Augmented Power System Operations
Abstract:
With the high penetration of renewables, traditional model-based power system operation is challenged to deliver economic, stable, and robust decisions. Machine learning has emerged as a powerful modeling tool for capturing complex dynamics to address these challenges. However, its separate design often lacks systematic integration with existing methods. To fill the gap, this paper proposes a holistic framework of Learning-Augmented Power System Operations (LAPSO, pronounced as Lap-So). Adopting a native optimization perspective, LAPSO is centered on the operation stage and aims to break the boundary between temporally siloed power system tasks, such as forecast, operation and control, while unifying the objectives of machine learning and model-based optimizations at both training and inference stages. Systematic analysis and simulations demonstrate the effectiveness of applying LAPSO in designing new integrated algorithms, such as stability-constrained optimization (SCO) and objective-based forecasting (OBF), while enabling end-to-end tracing of different sources of uncertainties. In addition, a dedicated Python package-lapso is introduced to automatically augment existing power system optimization models with learnable components. All code and data are available at https://github.com/xuwkk/lapso_exp.
Chinese: 本文提出了LAPSO框架,通过将机器学习与传统电力系统运行相结合,统一预测、运行和控制任务,旨在提高决策的经济性和鲁棒性。
English: The paper introduces LAPSO, a holistic framework that integrates machine learning with traditional power system operations to enhance economic and robust decision-making by unifying forecasting, operation, and control tasks.

Authors:Wei Peng, Kang Liu, Jianchen Hu, Meng Zhang
Title: Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
Abstract:
Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.
Chinese: Biomed-DPT提出了一种知识增强的双模态提示调优技术,通过结合临床提示和领域适配的文本提示以及视觉软提示,在多个生物医学图像数据集和模态上显著提升了分类准确率。
English: Biomed-DPT introduces a knowledge-enhanced dual modality prompt tuning technique that combines clinical and domain-adapted text prompts with vision soft prompts to significantly improve biomedical image classification accuracy across multiple datasets and modalities.

Authors:Cong Hua, Qianqian Xu, Zhiyong Yang, Zitai Wang, Shilong Bao, Qingming Huang
Title: OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning
Abstract:
Prompt tuning adapts Vision-Language Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performance separately on known classes (i.e., base domain) and unseen classes (i.e., new domain). However, real-world scenarios require models to handle inputs without prior domain knowledge. This practical challenge has spurred the development of open-world prompt tuning, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (P1), and 2) classifying the sample into its correct class (P2). What's more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (P3). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose OpenworldAUC, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize OpenworldAUC effectively, we introduce Gated Mixture-of-Prompts (GMoP), which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on OpenworldAUC and other metrics. We release the code at https://github.com/huacong/OpenworldAUC
中文: 本文提出了OpenworldAUC这一统一评估开放世界提示调优的指标,同时引入GMoP门控提示方法,在多个基准测试中实现了最优性能。
English: This paper introduces OpenworldAUC, a unified metric for evaluating open-world prompt tuning that jointly assesses domain detection and classification, and proposes GMoP, a gated prompt method that achieves state-of-the-art performance across benchmarks.

Authors:Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng
Title: Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Abstract:
The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into a sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs. The code is publicly available at https://github.com/Aatrox103/multilingual-llm-features.
中文摘要:本研究提出了一种新指标来评估稀疏自编码器(SAE)特征的单语性,发现某些特征与特定语言密切相关,通过针对性消除或增强这些特征,可精确控制大语言模型的多语言生成能力。
English Summary: This study introduces a novel metric to evaluate the monolinguality of features from Sparse Autoencoders (SAEs), revealing that certain features are language-specific and their targeted ablation or enhancement can precisely control the multilingual output of Large Language Models.

Authors:Shashank Agnihotri, Amaan Ansari, Annika Dackermann, Fabian Rösch, Margret Keuper
Title: DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions
Abstract:
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation
中文: 深度学习在立体视觉的视差估计中表现出色,但易受分布偏移和对抗攻击影响,为此推出了DispBench这一综合基准工具,用于评估多种数据集和干扰场景下的方法鲁棒性。
English: Deep learning excels in disparity estimation for stereo vision but faces reliability issues from distribution shifts and adversarial attacks, prompting the creation of DispBench, a comprehensive tool to benchmark robustness across various corruptions and datasets.

Authors:Wenyang Liu, Jianjun Gao, Kim-Hui Yap
Title: SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal
Abstract:
Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.
中文: SSH-Net是一种自监督混合网络,通过合成参考图像并采用具有共享特征编码的双网络架构来解决带噪图像水印去除问题,无需配对数据集即可实现高效处理。
English: SSH-Net is a self-supervised hybrid network that addresses noisy image watermark removal by synthesizing reference images and employing a dual-network architecture with shared feature encoding, eliminating the need for paired datasets.

Authors:Hyunho Song, Dongjae Lee, Seunghun Oh, Minwoo Jung, Ayoung Kim
Title: The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes
Abstract:
Large-scale construction and demolition significantly challenge long-term place recognition (PR) by drastically reshaping urban and suburban environments. Existing datasets predominantly reflect limited or indoor-focused changes, failing to adequately represent extensive outdoor transformations. To bridge this gap, we introduce the City that Never Settles (CNS) dataset, a simulation-based dataset created using the CARLA simulator, capturing major structural changes-such as building construction and demolition-across diverse maps and sequences. Additionally, we propose TCR_sym, a symmetric version of the original TCR metric, enabling consistent measurement of structural changes irrespective of source-target ordering. Quantitative comparisons demonstrate that CNS encompasses more extensive transformations than current real-world benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS reveal substantial performance degradation, underscoring the need for robust algorithms capable of handling significant environmental changes. Our dataset is available at https://github.com/Hyunho111/CNS_dataset.
中文摘要:本研究提出的“永动之城”(CNS)数据集通过模拟建筑建造与拆除等大规模户外结构变化,弥补了现有数据集的不足,并证明当前先进的地点识别方法在应对剧烈环境变化时存在显著性能缺陷。
English Summary: The City that Never Settles (CNS) dataset addresses the limitations of existing datasets by simulating large-scale outdoor structural changes through building construction and demolition, revealing significant performance drops in current place recognition methods when faced with such transformations.

Authors:Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin
Title: FG-CLIP: Fine-Grained Visual and Textual Alignment
Abstract:
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
中文:FG-CLIP通过大规模长文本描述、区域级标注和困难负样本增强CLIP的细粒度理解能力,在多项多模态任务中实现卓越性能。
English: FG-CLIP enhances CLIP's fine-grained understanding through large-scale long captions, region-specific annotations, and hard negative samples, achieving superior performance across multiple multimodal tasks.

Authors:Ruihuai Liang, Bo Yang, Pengyu Chen, Xuelin Cao, Zhiwen Yu, H. Vincent Poor, Chau Yuen
Title: Cross-Problem Solving for Network Optimization: Is Problem-Aware Learning the Key?
Abstract:
As intelligent network services continue to diversify, ensuring efficient and adaptive resource allocation in edge networks has become increasingly critical. Yet the wide functional variations across services often give rise to new and unforeseen optimization problems, rendering traditional manual modeling and solver design both time-consuming and inflexible. This limitation reveals a key gap between current methods and human solving - the inability to recognize and understand problem characteristics. It raises the question of whether problem-aware learning can bridge this gap and support effective cross-problem generalization. To answer this question, we propose a problem-aware diffusion (PAD) model, which leverages a problem-aware learning framework to enable cross-problem generalization. By explicitly encoding the mathematical formulations of optimization problems into token-level embeddings, PAD empowers the model to understand and adapt to problem structures. Extensive experiments across six diverse network optimization problems show that PAD generalizes well to unseen problems while significantly improving solution quality and feasibility. Meanwhile, an auxiliary constraint-aware module is designed to enforce solution validity further. The experiments reveal that problem-aware learning is promising for building general-purpose solvers for intelligent network operation and resource management. Our code is open source at https://github.com/qiyu3816/PAD.
中文: 提出的问题感知扩散(PAD)模型通过将优化问题的数学公式编码为令牌级嵌入,实现了跨问题泛化,在多种网络优化任务中显著提升了解决方案的质量和可行性。
English: The proposed problem-aware diffusion (PAD) model uses token-level embeddings of mathematical problem formulations to enable cross-problem generalization, significantly improving solution quality and feasibility across diverse network optimization tasks.

Authors:Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low
Title: WaterDrum: Watermarking for Data-centric Unlearning Metric
Abstract:
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.
Chinese: 本文提出了WaterDrum,一种基于数据的大语言模型遗忘度量方法,通过稳健文本水印技术克服现有基于效用的评估局限,并发布了包含不同相似度数据的新基准数据集用于严格评估。
English: This paper introduces WaterDrum, a data-centric unlearning metric for large language models that utilizes robust text watermarking to address limitations in existing utility-based evaluation methods, alongside new benchmark datasets for rigorous assessment.

Authors:Ao Jin, Weijian Zhao, Yifeng Ma, Panfeng Huang, Fan Zhang
Title: Enhanced Robust Tracking Control: An Online Learning Approach
Abstract:
This work focuses the tracking control problem for nonlinear systems subjected to unknown external disturbances. Inspired by contraction theory, a neural network-dirven CCM synthesis is adopted to obtain a feedback controller that could track any feasible trajectory. Based on the observation that the system states under continuous control input inherently contain embedded information about unknown external disturbances, we propose an online learning scheme that captures the disturbances dyanmics from online historical data and embeds the compensation within the CCM controller. The proposed scheme operates as a plug-and-play module that intrinsically enhances the tracking performance of CCM synthesis. The numerical simulations on tethered space robot and PVTOL demonstrate the effectiveness of proposed scheme. The source code of the proposed online learning scheme can be found at https://github.com/NPU-RCIR/Online_CCM.git.
中文: 本研究针对受未知外部干扰的非线性系统跟踪控制问题,提出了一种结合神经网络收缩度量的控制器与在线学习模块的方案,通过历史数据实时捕捉并补偿干扰动态,在空间绳系机器人和PVTOL系统的仿真中验证了其提升跟踪性能的有效性。
English: This study addresses tracking control for nonlinear systems under unknown disturbances by integrating a neural network-driven contraction metric controller with an online learning module that captures and compensates for disturbance dynamics from historical data, enhancing performance as demonstrated in simulations of tethered space robots and PVTOL systems.

Authors:Yingyi Zhang, Pengyue Jia, Xianneng Li, Derong Xu, Maolin Wang, Yichao Wang, Zhaocheng Du, Huifeng Guo, Yong Liu, Ruiming Tang, Xiangyu Zhao
Title: LSRP: A Leader-Subordinate Retrieval Framework for Privacy-Preserving Cloud-Device Collaboration
Abstract:
Cloud-device collaboration leverages on-cloud Large Language Models (LLMs) for handling public user queries and on-device Small Language Models (SLMs) for processing private user data, collectively forming a powerful and privacy-preserving solution. However, existing approaches often fail to fully leverage the scalable problem-solving capabilities of on-cloud LLMs while underutilizing the advantage of on-device SLMs in accessing and processing personalized data. This leads to two interconnected issues: 1) Limited utilization of the problem-solving capabilities of on-cloud LLMs, which fail to align with personalized user-task needs, and 2) Inadequate integration of user data into on-device SLM responses, resulting in mismatches in contextual user information. In this paper, we propose a Leader-Subordinate Retrieval framework for Privacy-preserving cloud-device collaboration (LSRP), a novel solution that bridges these gaps by: 1) enhancing on-cloud LLM guidance to on-device SLM through a dynamic selection of task-specific leader strategies named as user-to-user retrieval-augmented generation (U-U-RAG), and 2) integrating the data advantages of on-device SLMs through small model feedback Direct Preference Optimization (SMFB-DPO) for aligning the on-cloud LLM with the on-device SLM. Experiments on two datasets demonstrate that LSRP consistently outperforms state-of-the-art baselines, significantly improving question-answer relevance and personalization, while preserving user privacy through efficient on-device retrieval. Our code is available at: https://github.com/Applied-Machine-Learning-Lab/LSRP.
中文: LSRP框架通过动态选择云端大模型的引导策略并整合端侧小模型的数据优势,在保护隐私的同时显著提升了问答相关性和个性化水平。
English: The proposed LSRP framework enhances cloud-device collaboration by dynamically selecting leader strategies for on-cloud LLMs and integrating on-device SLM data advantages, achieving superior question-answer relevance and personalization while preserving privacy.

Authors:Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu
Title: Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
Abstract:
User interface (UI) design goes beyond visuals, guiding user behavior and overall user experience (UX). Strategically crafted interfaces, for example, can boost sign-ups and drive business sales, underscoring the shift toward UI/UX as a unified design concept. While recent studies have explored UI quality evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking behavior-oriented aspects. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for assessing models' multimodal understanding of UI/UX design. It includes 300 diverse real-world UI image pairs, each consisting of two design variants A/B-tested at scale by actual companies, where one was empirically validated to steer more user actions than the other. Each pair is accompanied one or more of 684 expert-curated rationales that capture key factors behind each winning design's effectiveness, spanning diverse cognitive dimensions of UX. Our benchmark supports two core tasks: (1) selecting the more effective UI/UX design by predicting the A/B test verified winner and (2) assessing how well a model, given the winner, can explain its effectiveness in alignment with expert reasoning. Experiments across several MLLMs show that current models exhibit limited nuanced reasoning about UI/UX design and its behavioral impact. We believe our work will foster research in UI/UX understanding and enable broader applications such as behavior-aware interface optimization.
中文摘要:本文提出WiserUI-Bench这一新型基准,通过300组经A/B测试验证的真实界面设计对和专家解析,填补了当前界面设计评估中行为导向分析的空白,实验表明现有模型对界面用户体验设计的细微推理能力仍显不足。
English Summary: This paper introduces WiserUI-Bench, a novel benchmark addressing the gap in evaluating UI/UX design by focusing on behavior-oriented aspects through 300 real-world UI image pairs with expert rationales, revealing current models' limited nuanced reasoning about design effectiveness.

Authors:Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, Hao Li
Title: SOAP: Style-Omniscient Animatable Portraits
Abstract:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.
中文: SOAP是一种风格普适的框架,能够从任意肖像生成具有完整骨骼绑定和一致拓扑的3D虚拟形象,通过自适应优化和可微分渲染技术,不仅支持基于FACS的面部动画,还能精细保留发饰与编发等复杂细节。
English: SOAP is a style-omniscient framework that generates fully rigged, topology-consistent 3D avatars from any portrait, supporting FACS-based animation and preserving intricate details like accessories and hairstyles through adaptive optimization and differentiable rendering.

Authors:Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Title: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
Abstract:
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.
中文:本文提出多阶段影响函数,用于在完整参数微调下将精调后大语言模型的预测溯源至预训练数据,并通过EK-FAC近似提升效率,实证结果验证了其优越可扩展性与案例解释力。
English: This paper introduces a multi-stage influence function to trace fine-tuned LLM predictions back to pre-training data, using EK-FAC approximation for scalability and demonstrating its effectiveness through empirical validation and case studies.

Authors:Xin Bi, Zhichao Li, Yuxuan Xia, Panpan Tong, Lijuan Zhang, Yang Chen, Junsheng Fu
Title: Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition
Abstract:
Accurate online map matching is fundamental to vehicle navigation and the activation of intelligent driving functions. Current online map matching methods are prone to errors in complex road networks, especially in multilevel road area. To address this challenge, we propose an online Standard Definition (SD) map matching method by constructing a Hidden Markov Model (HMM) with multiple probability factors. Our proposed method can achieve accurate map matching even in complex road networks by carefully leveraging lane markings and scenario recognition in the designing of the probability factors. First, the lane markings are generated by a multi-lane tracking method and associated with the SD map using HMM to build an enriched SD map. In areas covered by the enriched SD map, the vehicle can re-localize itself by performing Iterative Closest Point (ICP) registration for the lane markings. Then, the probability factor accounting for the lane marking detection can be obtained using the association probability between adjacent lanes and roads. Second, the driving scenario recognition model is applied to generate the emission probability factor of scenario recognition, which improves the performance of map matching on elevated roads and ordinary urban roads underneath them. We validate our method through extensive road tests in Europe and China, and the experimental results show that our proposed method effectively improves the online map matching accuracy as compared to other existing methods, especially in multilevel road area. Specifically, the experiments show that our proposed method achieves $F_1$ scores of 98.04% and 94.60% on the Zenseact Open Dataset and test data of multilevel road areas in Shanghai respectively, significantly outperforming benchmark methods. The implementation is available at https://github.com/TRV-Lab/LMSR-OMM.
Chinese: 本研究提出了一种基于隐马尔可夫模型和多概率因子的在线SD地图匹配方法,通过整合车道线关联和场景识别,在复杂多层道路区域显著提高了匹配精度,欧洲和中国的道路测试已验证其有效性。
English: This study introduces an online SD map matching method using a Hidden Markov Model with multiple probability factors, including lane marking association and scenario recognition, which significantly enhances accuracy in complex multilevel road networks as validated by tests in Europe and China.

Authors:Lang Nie, Chunyu Lin, Kang Liao, Yun Zhang, Shuaicheng Liu, Yao Zhao
Title: StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps
Abstract:
We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.
Chinese: 本研究提出了StabStitch++视频拼接框架,通过无监督学习同时实现空间对齐和时间稳定,有效消除拼接视频中的扭曲抖动,并在保持实时性的同时显著提升视觉体验。
English: This research introduces StabStitch++, a novel video stitching framework that simultaneously achieves spatial alignment and temporal stabilization through unsupervised learning, effectively eliminating warping shakes in stitched videos while maintaining real-time performance.

Authors:Ling Yue, Nithin Somasekharan, Yadi Cao, Shaowu Pan
Title: Foam-Agent: Towards Automated Intelligent CFD Workflows
Abstract:
Computational Fluid Dynamics (CFD) is an essential simulation tool in various engineering disciplines, but it often requires substantial domain expertise and manual configuration, creating barriers to entry. We present Foam-Agent, a multi-agent framework that automates complex OpenFOAM-based CFD simulation workflows from natural language inputs. Our innovation includes (1) a hierarchical multi-index retrieval system with specialized indices for different simulation aspects, (2) a dependency-aware file generation system that provides consistency management across configuration files, and (3) an iterative error correction mechanism that diagnoses and resolves simulation failures without human intervention. Through comprehensive evaluation on the dataset of 110 simulation tasks, Foam-Agent achieves an 83.6% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM and 37.3% for OpenFOAM-GPT). Ablation studies demonstrate the critical contribution of each system component, with the specialized error correction mechanism providing a 36.4% performance improvement. Foam-Agent substantially lowers the CFD expertise threshold while maintaining modeling accuracy, demonstrating the potential of specialized multi-agent systems to democratize access to complex scientific simulation tools. The code is public at https://github.com/csml-rpi/Foam-Agent
中文: Foam-Agent是一个多智能体框架,通过自然语言输入自动化OpenFOAM计算流体动力学仿真流程,凭借专业检索、文件生成和纠错系统实现了83.6%的成功率,大幅降低了专业门槛。
English: Foam-Agent is a multi-agent framework that automates OpenFOAM-based CFD simulations from natural language inputs, achieving an 83.6% success rate and significantly lowering the expertise barrier through specialized retrieval, file generation, and error correction systems.

Authors:Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang
Title: Rethinking Invariance in In-context Learning
Abstract:
In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.
Chinese: 上下文学习(ICL)存在对示例顺序敏感的问题,而提出的不变上下文学习(InvICL)方法通过确保信息不泄露和上下文相互依赖,在多数基准数据集上超越了现有模型,展现出卓越的泛化能力。
English: In-Context Learning (ICL) faces sensitivity to example order, but the proposed Invariant ICL (InvICL) method overcomes this by ensuring information non-leakage and context interdependence, achieving superior performance and generalization across benchmarks.

Authors:Xin Zhou, Xiaoxiong Zhang, Dusit Niyato, Zhiqi Shen
Title: Learning Item Representations Directly from Multimodal Features for Effective Recommendation
Abstract:
Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features over item ID embeddings. As a consequence, item ID embeddings frequently exhibit suboptimal characteristics despite the convergence of multimodal feature parameters. Given the rich informational content inherent in multimodal features, in this paper, we propose a novel model (i.e., LIRDRec) that learns item representations directly from these features to augment recommendation performance. Recognizing that features derived from each modality may capture disparate yet correlated aspects of items, we propose a multimodal transformation mechanism, integrated with modality-specific encoders, to effectively fuse features from all modalities. Moreover, to differentiate the influence of diverse modality types, we devise a progressive weight copying fusion module within LIRDRec. This module incrementally learns the weight assigned to each modality in synthesizing the final user or item representations. Finally, we utilize the powerful visual understanding of Multimodal Large Language Models (MLLMs) to convert the item images into texts and extract semantics embeddings upon the texts via LLMs. Empirical evaluations conducted on five real-world datasets validate the superiority of our approach relative to competing baselines. It is worth noting the proposed model, equipped with embeddings extracted from MLLMs and LLMs, can further improve the recommendation accuracy of NDCG@20 by an average of 4.21% compared to the original embeddings.
中文: 传统多模态推荐系统采用BPR优化时存在对多模态特征的梯度偏好,导致商品ID嵌入学习不足,因此我们提出LIRDRec模型直接通过多模态特征学习表示,结合融合机制与MLLM增强嵌入,显著提升了推荐性能。
English: Traditional multimodal recommender systems using BPR optimization often prioritize multimodal features over item ID embeddings, leading to suboptimal performance, so we propose LIRDRec which directly learns from multimodal features with a fusion mechanism and MLLM-enhanced embeddings to significantly boost recommendation accuracy.

Authors:Fangwei Zhu, Peiyi Wang, Zhifang Sui
Title: Chain-of-Thought Tokens are Computer Program Variables
Abstract:
Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.
中文:思维链(CoT)通过生成中间步骤帮助大语言模型解决复杂推理任务,本研究发现CoT标记类似程序中的变量,即使仅保留中间结果或以潜在形式存储,其性能仍可保持。
English: Chain-of-thoughts (CoT) enables large language models to solve complex reasoning tasks by generating intermediate steps, and this study reveals that CoT tokens function like variables in programs, with their performance maintained even when only intermediate results are preserved or altered in latent forms.

Authors:Md Aminul Islam, Ahmed Sayeed Faruk
Title: Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations
Abstract:
Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs' ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking.
Chinese: 本研究提出了一种结合传统推荐模型与大语言模型的混合重排序框架,发现尽管随机化用户历史可提升排序质量,但大语言模型在缓解位置偏差方面存在局限,且重排序效果未超越基础模型。
English: This study introduces a hybrid framework combining traditional recommendation models with large language models (LLMs) for reranking, revealing that LLMs struggle with mitigating position bias and fail to outperform base models in ranking tasks despite user history randomization.

Authors:Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Abstract:
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
中文: 推理作为智能的核心,在开放多模态环境中对人工智能至关重要;大型多模态推理模型从模块化流程发展为统一框架,以应对泛化与自主行为等挑战,本文通过发展路线图对此进行了系统综述。
English: Reasoning is central to intelligence and crucial for robust AI in open, multimodal environments, with Large Multimodal Reasoning Models evolving from modular pipelines to unified frameworks to address challenges like generalization and agentic behavior, as surveyed through a developmental roadmap.

Authors:Jiaqi Zheng, Qing Ling, Yerong Feng
Title: Physics-Assisted and Topology-Informed Deep Learning for Weather Prediction
Abstract:
Although deep learning models have demonstrated remarkable potential in weather prediction, most of them overlook either the \textbf{physics} of the underlying weather evolution or the \textbf{topology} of the Earth's surface. In light of these disadvantages, we develop PASSAT, a novel Physics-ASSisted And Topology-informed deep learning model for weather prediction. PASSAT attributes the weather evolution to two key factors: (i) the advection process that can be characterized by the advection equation and the Navier-Stokes equation; (ii) the Earth-atmosphere interaction that is difficult to both model and calculate. PASSAT also takes the topology of the Earth's surface into consideration, other than simply treating it as a plane. With these considerations, PASSAT numerically solves the advection equation and the Navier-Stokes equation on the spherical manifold, utilizes a spherical graph neural network to capture the Earth-atmosphere interaction, and generates the initial velocity fields that are critical to solving the advection equation from the same spherical graph neural network. In the $5.625^\circ$-resolution ERA5 data set, PASSAT outperforms both the state-of-the-art deep learning-based weather prediction models and the operational numerical weather prediction model IFS T42. Code and checkpoint are available at https://github.com/Yumenomae/PASSAT_5p625.
Chinese: PASSAT是一种创新的深度学习模型,通过结合物理方程和地球表面拓扑结构进行天气预报,在球面流形上求解平流和纳维-斯托克斯方程,并利用球面图神经网络捕捉地球-大气相互作用,从而超越了现有模型的性能。
English: PASSAT is a novel deep learning model that integrates physics equations and Earth's surface topology for weather prediction, outperforming existing models by solving advection and Navier-Stokes equations on a spherical manifold and capturing Earth-atmosphere interactions through a spherical graph neural network.

Authors:Shashank Agnihotri, David Schader, Nico Sharei, Mehmet Ege Kaçar, Margret Keuper
Title: Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?
Abstract:
Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation
中文: 深度学习模型易受现实世界分布变化影响,本研究证实合成扰动可作为语义分割稳健性测试的有效替代,与真实数据表现出强性能相关性。
English: Deep learning models are vulnerable to real-world distribution shifts, and this study confirms that synthetic corruptions serve as a reliable proxy for robustness testing in semantic segmentation, showing strong performance correlation with real-world data.

Authors:Bangyan Liao, Zhenjun Zhao, Haoang Li, Yi Zhou, Yingping Zeng, Hao Li, Peidong Liu
Title: Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
Abstract:
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a "soft" association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called GlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that GlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at https://github.com/WU-CVGL/GlobustVP.
中文: 本文提出GlobustVP方法,通过凸松弛技术和软关联方案,在曼哈顿世界中高效估计消失点,实现了计算效率、鲁棒性和全局最优性之间的良好平衡。
English: This paper introduces GlobustVP, a novel method that uses convex relaxation and a soft association scheme to efficiently estimate vanishing points in Manhattan worlds, achieving a balance between computational efficiency, robustness, and global optimality.

Authors:Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, Huihui Bai
Title: EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events
Abstract:
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic range, exhibit compelling promise in vision tasks. This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. Our approach hinges on two pivotal components: 1) Event-adapted synthesis capitalizes on the spatiotemporal correlations between frames and events to discern and learn long-term motion trajectories, enabling the adaptive interpolation and fusion of informative spatiotemporal features; 2) Local implicit video transformer integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations utilized to generate plausible videos at arbitrary resolutions and frame rates. Experiments show that EvEnhancer achieves superiority on synthetic and real-world datasets and preferable generalizability on out-of-distribution scales against state-of-the-art methods. Code is available at https://github.com/W-Shuoyan/EvEnhancer.
中文: EvEnhancer是一种创新方法,利用事件流的优势提升连续时空视频超分辨率的效果,通过事件自适应合成和局部隐式视频变换器,在任意尺度上实现了卓越性能和泛化能力。
English: EvEnhancer is a novel method that leverages event streams to enhance continuous space-time video super-resolution, achieving superior performance and generalizability across arbitrary scales through event-adapted synthesis and a local implicit video transformer.

Authors:Zilong Chen, Yikai Wang, Wenqiang Sun, Feng Wang, Yiwen Chen, Huaping Liu
Title: MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation
Abstract:
In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. Addressing the challenges faced by existing 3D native diffusion models, such as suboptimal auto-encoder performance, limited controllability, poor generalization, and inconsistent image-based PBR texturing, MeshGen employs several key innovations to overcome these limitations. We pioneer a render-enhanced point-to-shape auto-encoder that compresses meshes into a compact latent space by designing perceptual optimization with ray-based regularization. This ensures that the 3D shapes are accurately represented and reconstructed to preserve geometric details within the latent space. To address data scarcity and image-shape misalignment, we further propose geometric augmentation and generative rendering augmentation techniques, which enhance the model's controllability and generalization ability, allowing it to perform well even with limited public datasets. For the texture generation, MeshGen employs a reference attention-based multi-view ControlNet for consistent appearance synthesis. This is further complemented by our multi-view PBR decomposer that estimates PBR components and a UV inpainter that fills invisible areas, ensuring a seamless and consistent texture across the 3D mesh. Our extensive experiments demonstrate that MeshGen largely outperforms previous methods in both shape and texture generation, setting a new standard for the quality of 3D meshes generated with PBR textures. See our code at https://github.com/heheyas/MeshGen, project page https://heheyas.github.io/MeshGen
中文: MeshGen提出了一种先进的图像到3D生成流程,通过渲染增强自编码器和多视角纹理合成等创新技术,在几何细节和PBR材质生成质量上显著超越了现有方法。
English: MeshGen introduces an advanced image-to-3D pipeline that overcomes limitations of existing models through innovations like a render-enhanced auto-encoder and multi-view texture synthesis, achieving superior quality in both geometry and PBR textures.

Authors:Yi Lin, Dong Zhang, Xiao Fang, Yufan Chen, Kwang-Ting Cheng, Hao Chen
Title: Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation
Abstract:
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.
中文:CTO网络架构融合了CNN、视觉变换器和边缘检测技术,显著提升了医学图像分割的精度,尤其在边界区域表现出色,无需额外数据即在多个数据集上达到领先水平。
English: The CTO network architecture integrates CNNs, Vision Transformers, and edge detection to enhance medical image segmentation accuracy, particularly in boundary areas, achieving state-of-the-art results across multiple datasets without extra data requirements.

Authors:Hicham Assoudi
Title: A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)
Abstract:
This paper presents a comparative benchmark evaluating the performance of Typica.ai's custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight Typica.ai's superior performance, underlining the importance of culturally adapted models for reliable content moderation.
中文摘要:本研究对Typica.ai的摩洛哥方言毒性检测模型与主流LLM审核API进行对比评估,结果表明该模型在识别文化特异性有害内容方面表现更优,凸显了文化适配模型的重要性。
English Summary: This study benchmarks Typica.ai's Moroccan Darija toxicity detection model against leading LLM moderation APIs, demonstrating its superior performance in identifying culturally specific toxic content through comprehensive metrics.

Authors:Yuning Du, Jingshuai Liu, Rohan Dharmakumar, Sotirios A. Tsaftaris
Title: Active Sampling for MRI-based Sequential Decision Making
Abstract:
Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling
中文摘要:本研究提出了一种多目标强化学习框架,通过自适应优化采样策略,能够从欠采样的MRI k空间数据中实现连续诊断评估,在保持膝关节病变诊断准确性的同时大幅减少采样需求。
English Summary: This study introduces a multi-objective reinforcement learning framework that enables sequential diagnostic evaluations from undersampled MRI k-space data, significantly reducing sampling requirements while maintaining competitive diagnostic accuracy for knee pathology assessments.

Authors:Kunlun Xu, Xu Zou, Gang Hua, Jiahuan Zhou
Title: Componential Prompt-Knowledge Alignment for Domain Incremental Learning
Abstract:
Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference.To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available at https://github.com/zhoujiahuan1991/ICML2025-KA-Prompt
中文摘要:本文针对基于提示的领域增量学习中领域特定提示组件错位导致知识冲突的问题,提出KA-Prompt方法,通过组件感知的提示-知识对齐机制显著提升模型的学习和推理能力。
English Summary: This paper identifies a limitation in prompt-based Domain Incremental Learning where misaligned domain-specific prompts cause conflicting knowledge integration, and proposes KA-Prompt with component-aware alignment to enhance model learning and inference capabilities.

Authors:Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang
Title: ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence
Abstract:
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $α$-$β$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.
中文: 知识蒸馏的核心挑战在于平衡硬度集中和置信度集中两种效应,ABKD通过α-β散度实现了前向KL散度与反向KL散度之间的平滑插值,在多项实验中验证了其有效性。
English: Knowledge Distillation faces a challenge in balancing the Hardness-Concentration and Confidence-Concentration effects, which ABKD addresses using α-β-divergence to effectively trade off between FKLD and RKLD, as validated by extensive experiments.

Authors:Ashutosh Singandhupe, Sanket Lokhande, Hung Manh La
Title: Registration of 3D Point Sets Using Exponential-based Similarity Matrix
Abstract:
Point cloud registration is a fundamental problem in computer vision and robotics, involving the alignment of 3D point sets captured from varying viewpoints using depth sensors such as LiDAR or structured light. In modern robotic systems, especially those focused on mapping, it is essential to merge multiple views of the same environment accurately. However, state-of-the-art registration techniques often struggle when large rotational differences exist between point sets or when the data is significantly corrupted by sensor noise. These challenges can lead to misalignments and, consequently, to inaccurate or distorted 3D reconstructions. In this work, we address both these limitations by proposing a robust modification to the classic Iterative Closest Point (ICP) algorithm. Our method, termed Exponential Similarity Matrix ICP (ESM-ICP), integrates a Gaussian-inspired exponential weighting scheme to construct a similarity matrix that dynamically adapts across iterations. This matrix facilitates improved estimation of both rotational and translational components during alignment. We demonstrate the robustness of ESM-ICP in two challenging scenarios: (i) large rotational discrepancies between the source and target point clouds, and (ii) data corrupted by non-Gaussian noise. Our results show that ESM-ICP outperforms traditional geometric registration techniques as well as several recent learning-based methods. To encourage reproducibility and community engagement, our full implementation is made publicly available on GitHub. https://github.com/aralab-unr/ESM_ICP
中文: 本文提出ESM-ICP算法,通过引入指数相似性矩阵改进传统迭代最近点方法,有效解决了点云在大旋转差异和非高斯噪声下的配准难题,其性能优于现有主流方法。
English: This paper introduces ESM-ICP, a robust variant of the Iterative Closest Point algorithm that uses an exponential similarity matrix to improve registration accuracy under large rotational differences and non-Gaussian noise, outperforming existing methods.

Authors:Qi Zhou, Yukai Shi, Xiaojun Yang, Xiaoyu Xian, Lunjia Liao, Ruimao Zhang, Liang Lin
Title: DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once
Abstract:
Visible and infrared image fusion is one of the most crucial tasks in the field of image fusion, aiming to generate fused images with clear structural information and high-quality texture features for high-level vision tasks. However, when faced with severe illumination degradation in visible images, the fusion results of existing image fusion methods often exhibit blurry and dim visual effects, posing major challenges for autonomous driving. To this end, a Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO), which employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion), addressing the issue of information entropy loss caused by hierarchical data transmission. Specifically, we construct a latent-common feature extractor (LCFE) to obtain latent features for the cascaded tasks strategy. Firstly, a details-extraction module (DEM) is devised to acquire high-frequency semantic information. Secondly, we design a hyper cross-attention module (HCAM) to extract low-frequency information and preserve texture features from source images. Finally, a relevant loss function is designed to guide the holistic network learning, thereby achieving better image fusion. Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations. Particularly, DFVO can generate clearer, more informative, and more evenly illuminated fusion results in the dark environments, achieving best performance on the LLVIP dataset with 63.258 dB PSNR and 0.724 CC, providing more effective information for high-level vision tasks. Our code is publicly accessible at https://github.com/DaVin-Qi530/DFVO.
中文摘要:提出的Darkness-Free网络(DFVO)通过单级级联框架同时处理可见光与红外图像的分离与融合,解决了弱光条件下传统方法融合结果模糊的问题,在黑暗环境中能生成更清晰、信息更丰富的融合图像。
English Summary: The proposed Darkness-Free network (DFVO) addresses blurry fusion results under poor lighting by integrating visible-infrared image disentanglement and fusion in a single cascaded framework, achieving superior performance with clearer, more informative outputs in dark environments.

Authors:Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park
Title: TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution
Abstract:
Trajectory prediction is a crucial task in modeling human behavior, especially in fields as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, lack of explainability, and generalization issues that limit their practical adoption. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on the ETH-UCY datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to the unseen SDD dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research at https://github.com/ai4co/trajevo.
中文: TrajEvo是一种创新框架,利用大型语言模型和进化算法自动设计轨迹预测启发式方法,在多个数据集上超越了传统方法和深度学习方法,具备高准确性、可解释性和泛化能力。
English: TrajEvo is a novel framework that uses Large Language Models and evolutionary algorithms to automatically design accurate, explainable, and generalizable trajectory prediction heuristics, outperforming traditional and deep learning methods across datasets.

Authors:Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang
Title: RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation
Abstract:
Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.
中文: 本文提出RLMiniStyler强化学习框架,通过迭代策略引导和不确定性感知的多任务学习策略,以更低成本生成多样化高质量风格化图像序列,在多个分辨率图像上验证了其优越性。
English: This paper introduces RLMiniStyler, a reinforcement learning-based framework that efficiently generates diverse stylized image sequences through iterative policy guidance and an uncertainty-aware multi-task learning strategy, achieving high-quality results with reduced computational costs.

Authors:Ren Wang, Pengcheng Zhou
Title: Latent Manifold Reconstruction and Representation with Topological and Geometrical Regularization
Abstract:
Manifold learning aims to discover and represent low-dimensional structures underlying high-dimensional data while preserving critical topological and geometric properties. Existing methods often fail to capture local details with global topological integrity from noisy data or construct a balanced dimensionality reduction, resulting in distorted or fractured embeddings. We present an AutoEncoder-based method that integrates a manifold reconstruction layer, which uncovers latent manifold structures from noisy point clouds, and further provides regularizations on topological and geometric properties during dimensionality reduction, whereas the two components promote each other during training. Experiments on point cloud datasets demonstrate that our method outperforms baselines like t-SNE, UMAP, and Topological AutoEncoders in discovering manifold structures from noisy data and preserving them through dimensionality reduction, as validated by visualization and quantitative metrics. This work demonstrates the significance of combining manifold reconstruction with manifold learning to achieve reliable representation of the latent manifold, particularly when dealing with noisy real-world data. Code repository: https://github.com/Thanatorika/mrtg.
Chinese: 本文提出了一种基于自动编码器的方法,将流形重建与拓扑和几何正则化相结合,能够有效从噪声数据中发现并保持潜在结构,在可视化和定量指标上均优于现有技术。
English: This paper introduces an AutoEncoder-based method that integrates manifold reconstruction with topological and geometric regularizations to effectively discover and preserve latent structures from noisy data, outperforming existing techniques in both visualization and quantitative metrics.

Authors:Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian
Title: DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Abstract:
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.
Chinese: DeCLIP通过解耦自注意力机制为“内容”和“上下文”特征来增强CLIP,有效提升局部判别能力和空间一致性,在开放词汇密集预测任务中表现卓越。
English: DeCLIP enhances CLIP by decoupling self-attention into "content" and "context" features to improve local discriminability and spatial consistency, achieving superior performance in open-vocabulary dense prediction tasks.

Authors:Mohammad Elayan, Wissam Kontar
Title: Consensus-Aware AV Behavior: Trade-offs Between Safety, Interaction, and Performance in Mixed Urban Traffic
Abstract:
Transportation systems have long been shaped by complexity and heterogeneity, driven by the interdependency of agent actions and traffic outcomes. The deployment of automated vehicles (AVs) in such systems introduces a new challenge: achieving consensus across safety, interaction quality, and traffic performance. In this work, we position consensus as a fundamental property of the traffic system and aim to quantify it. We use high-resolution trajectory data from the Third Generation Simulation (TGSIM) dataset to empirically analyze AV and human-driven vehicle (HDV) behavior at a signalized urban intersection and around vulnerable road users (VRUs). Key metrics, including Time-to-Collision (TTC), Post-Encroachment Time (PET), deceleration patterns, headways, and string stability, are evaluated across the three performance dimensions. Results show that full consensus across safety, interaction, and performance is rare, with only 1.63% of AV-VRU interaction frames meeting all three conditions. These findings highlight the need for AV models that explicitly balance multi-dimensional performance in mixed-traffic environments. Full reproducibility is supported via our open-source codebase on https://github.com/wissamkontar/Consensus-AV-Analysis.
中文: 本研究通过分析自动驾驶与人类驾驶车辆的交互,量化交通系统中的共识,发现安全、交互质量和性能三者完全一致的情况极少,强调需开发能平衡这些维度的自动驾驶模型。
English: This study quantifies consensus in traffic systems by analyzing automated and human-driven vehicle interactions, finding that full agreement across safety, interaction quality, and performance is rare, highlighting the need for AV models that balance these dimensions.

Authors:Jie Sun, Heng Liu, Yongzhen Wang, Xiao-Ping Zhang, Mingqiang Wei
Title: WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing
Abstract:
In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at https://github.com/SunJ000/WDMamba.
中文摘要:本文提出WDMamba新型去雾框架,利用雾霾相关的小波退化先验将去雾过程分解为低频恢复和细节增强两阶段,通过Mamba模块重构全局结构并结合自引导对比正则化恢复精细信息,在公共基准测试中实现了最优的去雾效果。
English Summary: This paper introduces WDMamba, a novel dehazing framework that leverages a haze-specific wavelet degradation prior to decompose image dehazing into low-frequency restoration using Mamba blocks for global structure recovery and detail enhancement for fine-grained information, achieving state-of-the-art performance through a self-guided contrastive regularization during training.

Authors:Kai Ruan, Mowen Huang, Ji-Rong Wen, Hao Sun
Title: Benchmarking LLMs' Swarm intelligence
Abstract:
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict swarm-like constraints-limited local perception and communication-remains largely unexplored. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely solely on local sensory input ($k\times k$ view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Zero-shot evaluations of leading LLMs (e.g., deepseek-v3, o4-mini) reveal significant task-dependent performance variations. While some rudimentary coordination is observed, our results indicate that current LLMs significantly struggle with robust long-range planning and adaptive strategy formation under the uncertainty inherent in these decentralized scenarios. Assessing LLMs under such swarm-like constraints is crucial for understanding their utility in future decentralized intelligent systems. We release SwarmBench as an open, extensible toolkit-built on a customizable physical system-providing environments, prompts, evaluation scripts, and comprehensive datasets. This aims to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of emergent collective behavior under severe informational decentralization. Our code repository is available at https://github.com/x66ccff/swarmbench.
中文: 大型语言模型在严格群体约束下的分散式多智能体系统中表现出有限的涌现协调能力,新型SwarmBench基准测试揭示了其在长程规划和适应性策略形成方面的显著不足。
English: Large Language Models exhibit limited emergent coordination in decentralized multi-agent systems under strict swarm constraints, as revealed by the novel SwarmBench benchmark which highlights their struggles with long-range planning and adaptive strategies.

Authors:Yi Li, Zhiyuan Zhang, Jiangnan Xia, Jianghan Cheng, Qilong Wu, Junwei Li, Yibin Tian, Hui Kong
Title: TS-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement
Abstract:
This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning stage, CFIs are averaged to create a target-specific CFI$^T$, which is fine-tuned using a small amount of real RAW data to adapt to the noise characteristics of specific cameras. A structural reparameterization technique further simplifies CFI$^T$ for efficient deployment. To address color shifts during the diffusion process, a color corrector is introduced to ensure color consistency by dynamically adjusting global color distributions. Additionally, a novel dataset, QID, is constructed, featuring quantifiable illumination levels and a wide dynamic range, providing a comprehensive benchmark for training and evaluation under extreme low-light conditions. Experimental results demonstrate that TS-Diff achieves state-of-the-art performance on multiple datasets, including QID, SID, and ELD, excelling in denoising, generalization, and color consistency across various cameras and illumination levels. These findings highlight the robustness and versatility of TS-Diff, making it a practical solution for low-light imaging applications. Source codes and models are available at https://github.com/CircccleK/TS-Diff
本文提出TS-Diff模型,通过虚拟相机预训练和针对特定设备的微调两阶段方法增强极暗RAW图像,在多个数据集上实现了领先的去噪效果与色彩一致性。
This paper introduces TS-Diff, a two-stage diffusion model that enhances extremely low-light RAW images through pre-training with virtual cameras and fine-tuning for specific devices, achieving top performance in denoising and color consistency across multiple datasets.

Authors:Weiwei Ye, Zhuopeng Xu, Ning Gui
Title: Non-stationary Diffusion For Probabilistic Time Series Forecasting
Abstract:
Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at https://github.com/wwy155/NsDiff.
中文: 本文提出NsDiff,一种基于扩散的概率预测框架,采用位置尺度噪声模型来捕捉时间序列中的时变不确定性,克服了传统模型固定方差假设的局限,并在多个数据集上展现出优越性能。
English: This paper introduces NsDiff, a diffusion-based probabilistic forecasting framework that employs the Location-Scale Noise Model to capture time-varying uncertainty in time series, overcoming the limitations of traditional models with fixed variance assumptions and demonstrating superior performance across multiple datasets.

Authors:Yajie Fu, Chaorui Huang, Junwei Li, Hui Kong, Yibin Tian, Huakang Li, Zhiyuan Zhang
Title: HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation
Abstract:
We propose HDiffTG, a novel 3D Human Pose Estimation (3DHPE) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework. HDiffTG leverages the strengths of these techniques to significantly improve pose estimation accuracy and robustness while maintaining a lightweight design. The Transformer captures global spatiotemporal dependencies, the GCN models local skeletal structures, and the diffusion model provides step-by-step optimization for fine-tuning, achieving a complementary balance between global and local features. This integration enhances the model's ability to handle pose estimation under occlusions and in complex scenarios. Furthermore, we introduce lightweight optimizations to the integrated model and refine the objective function design to reduce computational overhead without compromising performance. Evaluation results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HDiffTG achieves state-of-the-art (SOTA) performance on the MPI-INF-3DHP dataset while excelling in both accuracy and computational efficiency. Additionally, the model exhibits exceptional robustness in noisy and occluded environments. Source codes and models are available at https://github.com/CirceJie/HDiffTG
中文: HDiffTG是一种创新的三维人体姿态估计方法,将Transformer、图卷积网络和扩散模型整合到统一框架中,在保持计算效率的同时实现了顶尖的精度与鲁棒性。
English: HDiffTG is a novel 3D human pose estimation method that integrates Transformer, GCN, and diffusion models into a unified framework, achieving state-of-the-art accuracy and robustness while maintaining computational efficiency.

Authors:Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie
Title: Object-Shot Enhanced Grounding Network for Egocentric Video
Abstract:
Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
中文:OSGNet提出了一种对象-镜头增强的定位网络,通过提取对象信息和利用自我中心视频的镜头运动来增强模态对齐能力,在三个数据集上实现了最先进的性能。
English: OSGNet introduces an object-shot enhanced grounding network that enriches video representation with object information and leverages egocentric shot movements to improve modality alignment, achieving state-of-the-art performance on three datasets.

Authors:Trinh T. L. Vuong, Jin Tae Kwak
Title: VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
Abstract:
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.
中文:VideoPath-LLaVA是计算病理学中首个集成多种图像场景的大型多模态模型,通过模拟病理学家的自然诊断过程生成详细组织学描述和最终诊断,并借助创新的数据处理和训练方法设立了病理视频分析新基准。
English: VideoPath-LLaVA is the first large multimodal model in computational pathology that integrates multiple image scenarios to mimic pathologists' diagnostic process, generating detailed descriptions and definitive diagnoses while setting a new benchmark through innovative data handling and training methods.

Authors:Hail Song, Wonsik Shin, Naeun Lee, Soomin Chung, Nojun Kwak, Woontack Woo
Title: S3D: Sketch-Driven 3D Model Generation
Abstract:
Generating high-quality 3D models from 2D sketches is a challenging task due to the inherent ambiguity and sparsity of sketch data. In this paper, we present S3D, a novel framework that converts simple hand-drawn sketches into detailed 3D models. Our method utilizes a U-Net-based encoder-decoder architecture to convert sketches into face segmentation masks, which are then used to generate a 3D representation that can be rendered from novel views. To ensure robust consistency between the sketch domain and the 3D output, we introduce a novel style-alignment loss that aligns the U-Net bottleneck features with the initial encoder outputs of the 3D generation module, significantly enhancing reconstruction fidelity. To further enhance the network's robustness, we apply augmentation techniques to the sketch dataset. This streamlined framework demonstrates the effectiveness of S3D in generating high-quality 3D models from sketch inputs. The source code for this project is publicly available at https://github.com/hailsong/S3D.
Chinese: S3D框架通过U-Net架构和创新的风格对齐损失函数,成功将简单的二维草图转化为精细的三维模型,确保了高质量的重建效果。
English: The S3D framework effectively converts simple 2D sketches into detailed 3D models using a U-Net architecture and a novel style-alignment loss to ensure high reconstruction fidelity.

Authors:Hongyi Li, Jun Xu, William Ward Armstrong
Title: LHT: Statistically-Driven Oblique Decision Trees for Interpretable Classification
Abstract:
We introduce the Learning Hyperplane Tree (LHT), a novel oblique decision tree model designed for expressive and interpretable classification. LHT fundamentally distinguishes itself through a non-iterative, statistically-driven approach to constructing splitting hyperplanes. Unlike methods that rely on iterative optimization or heuristics, LHT directly computes the hyperplane parameters, which are derived from feature weights based on the differences in feature expectations between classes within each node. This deterministic mechanism enables a direct and well-defined hyperplane construction process. Predictions leverage a unique piecewise linear membership function within leaf nodes, obtained via local least-squares fitting. We formally analyze the convergence of the LHT splitting process, ensuring that each split yields meaningful, non-empty partitions. Furthermore, we establish that the time complexity for building an LHT up to depth $d$ is $O(mnd)$, demonstrating the practical feasibility of constructing trees with powerful oblique splits using this methodology. The explicit feature weighting at each split provides inherent interpretability. Experimental results on benchmark datasets demonstrate LHT's competitive accuracy, positioning it as a practical, theoretically grounded, and interpretable alternative in the landscape of tree-based models. The implementation of the proposed method is available at https://github.com/Hongyi-Li-sz/LHT_model.
Chinese: 学习超平面树(LHT)是一种新型斜决策树,通过非迭代的统计驱动方法构建分割超平面,凭借明确的特征权重提供内在可解释性,并在基准数据集上展现出有竞争力的准确率。
English: The Learning Hyperplane Tree (LHT) is a novel oblique decision tree that uses a non-iterative, statistically-driven method to construct splitting hyperplanes, offering competitive accuracy and inherent interpretability through explicit feature weighting.

Authors:Zixiang Ai, Zichen Liu, Jiahuan Zhou
Title: Vision Graph Prompting via Semantic Low-Rank Decomposition
Abstract:
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequence-based representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG's transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-VGP.
中文: 提出的视觉图提示(VGP)框架利用低秩语义特征增强视觉图神经网络的高效微调,在多项任务中达到与全参数微调相当的性能。
English: The proposed Vision Graph Prompting (VGP) framework leverages low-rank semantic features to enhance parameter-efficient fine-tuning of Vision GNNs, achieving performance comparable to full fine-tuning across diverse tasks.

Authors:Zixiang Ai, Zichen Liu, Yuanhang Lei, Zhenyu Cui, Xu Zou, Jiahuan Zhou
Title: GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model
Abstract:
Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-GAPrompt.
中文: 提出的几何感知点云提示(GAPrompt)通过辅助输入和传播机制整合几何线索,以仅2.19%的可训练参数实现与全微调相媲美的性能,显著提升了3D视觉模型的适应性。
English: The proposed Geometry-Aware Point Cloud Prompt (GAPrompt) enhances 3D vision models by integrating geometric cues through auxiliary inputs and propagation mechanisms, achieving competitive performance with full fine-tuning while using only 2.19% of trainable parameters.

Authors:Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang, Ruiming Tang, Hong Xie, Defu Lian
Title: Advancing and Benchmarking Personalized Tool Invocation for LLMs
Abstract:
Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at https://github.com/hyfshadow/PTBench.
中文: 本文提出个性化工具调用概念,通过PTool框架和PTBench基准测试解决工具选择偏好和用户档案相关查询问题,有效提升了大型语言模型的实际应用能力。
English: This paper introduces Personalized Tool Invocation, addressing user-specific constraints in tool selection and parameter inference through the proposed PTool framework and PTBench benchmark, demonstrating their effectiveness in enhancing LLM capabilities.

Authors:Xuyang Wang, Siyuan Duan, Qizhi Li, Guiduo Duan, Yuan Sun, Dezhong Peng
Title: Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks
Abstract:
Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.
中文摘要:本文提出RDML框架,通过证据引导的解耦学习分离多视图数据中的正常特征与对抗扰动,采用特征重校准和视图级证据注意力机制,显著提升了多视图分类任务在对抗攻击下的可靠性。
English Summary: This paper introduces RDML, a reliable multi-view learning framework that addresses adversarial unreliability by disentangling clean and adversarial features using evidence-guided mechanisms to enhance model robustness against attacks.

Authors:Feng Gao, Sheng Liu, Chuanzheng Gong, Xiaowei Zhou, Jiayi Wang, Junyu Dong, Qian Du
Title: Prototype-Based Information Compensation Network for Multi-Source Remote Sensing Data Classification
Abstract:
Multi-source remote sensing data joint classification aims to provide accuracy and reliability of land cover classification by leveraging the complementary information from multiple data sources. Existing methods confront two challenges: inter-frequency multi-source feature coupling and inconsistency of complementary information exploration. To solve these issues, we present a Prototype-based Information Compensation Network (PICNet) for land cover classification based on HSI and SAR/LiDAR data. Specifically, we first design a frequency interaction module to enhance the inter-frequency coupling in multi-source feature extraction. The multi-source features are first decoupled into high- and low-frequency components. Then, these features are recoupled to achieve efficient inter-frequency communication. Afterward, we design a prototype-based information compensation module to model the global multi-source complementary information. Two sets of learnable modality prototypes are introduced to represent the global modality information of multi-source data. Subsequently, cross-modal feature integration and alignment are achieved through cross-attention computation between the modality-specific prototype vectors and the raw feature representations. Extensive experiments on three public datasets demonstrate the significant superiority of our PICNet over state-of-the-art methods. The codes are available at https://github.com/oucailab/PICNet.
中文:提出的基于原型的信息补偿网络(PICNet)通过频率交互模块和原型补偿机制,解决了多源遥感分类中的频间特征耦合与互补信息不一致问题,在三个公开数据集上验证了其显著优越性。
English: The proposed Prototype-based Information Compensation Network (PICNet) addresses inter-frequency feature coupling and complementary information inconsistency in multi-source remote sensing classification by introducing frequency interaction and prototype-based compensation modules, demonstrating superior performance on three public datasets.

Authors:Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A. Selby, Sebastian J. Vollmer
Title: The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
Abstract:
According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.
中文摘要:研究表明,通过共享叙事引导LLM智能体可显著提升其在谈判中的协作表现,而相异的故事设定则会削弱合作效果,使利己策略占据上风。
English Summary: This study demonstrates that priming LLM agents with shared narratives significantly enhances their collaborative behavior in negotiations, whereas conflicting narratives undermine cooperation and favor self-interested strategies.

Authors:Xiang Li, Yiyang Hao, Doug Fulop
Title: Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents
Abstract:
One of the primary aspirations in reinforcement learning research is developing general-purpose agents capable of rapidly adapting to and mastering novel tasks. While RL gaming agents have mastered many Atari games, they remain slow and costly to train for each game. In this work, we demonstrate that latest reasoning LLMs with out-of-domain RL post-training can play a challenging Atari game called Frogger under a zero-shot setting. We then investigate the effect of in-context learning and the amount of reasoning effort on LLM performance. Lastly, we demonstrate a way to bootstrap traditional RL method with LLM demonstrations, which significantly improves their performance and sample efficiency. Our implementation is open sourced at https://github.com/AlienKevin/frogger.
中文摘要:本研究展示了大型语言模型结合强化学习无需预先训练即可成功玩转Atari游戏《青蛙过河》,并能通过示范显著提升传统强化学习方法的性能与样本效率。
English Summary: This study shows that large language models, when combined with reinforcement learning, can successfully play the Atari game Frogger without prior training and enhance traditional RL methods through demonstrations, boosting both performance and efficiency.

Authors:Shuang Zeng, Chee Hong Lee, Micky C Nnamdi, Wenqi Shi, J Ben Tamo, Lei Zhu, Hangzhou He, Xinliang Zhang, Qian Chen, May D. Wang, Yanye Lu, Qiushi Ren
Title: Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation
Abstract:
Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov-Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov-Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at https://github.com/stevezs315/AttUKAN.
Chinese: 研究者提出了一种新型注意力U形Kolmogorov-Arnold网络AttUKAN及标签引导的像素级对比损失,通过注意力机制和对比学习增强特征判别能力,在视网膜血管分割任务中实现了最优性能。
English: The authors propose AttUKAN, a novel Attention U-shaped Kolmogorov-Arnold Network with Label-guided Pixel-wise Contrastive Loss, which achieves state-of-the-art retinal vessel segmentation performance by enhancing feature discrimination through attention mechanisms and contrastive learning.

Authors:Tin Mišić, Karlo Koledić, Fabio Bonsignorio, Ivan Petrović, Ivan Marković
Title: An Active Inference Model of Covert and Overt Visual Attention
Abstract:
The ability to selectively attend to relevant stimuli while filtering out distractions is essential for agents that process complex, high-dimensional sensory input. This paper introduces a model of covert and overt visual attention through the framework of active inference, utilizing dynamic optimization of sensory precisions to minimize free-energy. The model determines visual sensory precisions based on both current environmental beliefs and sensory input, influencing attentional allocation in both covert and overt modalities. To test the effectiveness of the model, we analyze its behavior in the Posner cueing task and a simple target focus task using two-dimensional(2D) visual data. Reaction times are measured to investigate the interplay between exogenous and endogenous attention, as well as valid and invalid cueing. The results show that exogenous and valid cues generally lead to faster reaction times compared to endogenous and invalid cues. Furthermore, the model exhibits behavior similar to inhibition of return, where previously attended locations become suppressed after a specific cue-target onset asynchrony interval. Lastly, we investigate different aspects of overt attention and show that involuntary, reflexive saccades occur faster than intentional ones, but at the expense of adaptability.
中文: 本文提出了一种基于主动推理的视觉注意力模型,通过动态优化感官精度来管理注意分配,实验表明外源性及有效线索能引发更快反应,并揭示了反射性与意向性眼动在速度与适应性上的差异。
English: This paper presents an active inference model that optimizes visual attention by dynamically adjusting sensory precisions, demonstrating through experiments that exogenous and valid cues yield faster responses and revealing distinct temporal characteristics between reflexive and intentional eye movements.

Authors:Chongsheng Zhang, Shuwen Wu, Yingqi Chen, Yi Men, Gaojuan Fan, Matthias Aßenmacher, Christian Heumann, João Gama
Title: Explainable Coarse-to-Fine Ancient Manuscript Duplicates Discovery
Abstract:
Ancient manuscripts are the primary source of ancient linguistic corpora. However, many ancient manuscripts exhibit duplications due to unintentional repeated publication or deliberate forgery. The Dead Sea Scrolls, for example, include counterfeit fragments, whereas Oracle Bones (OB) contain both republished materials and fabricated specimens. Identifying ancient manuscript duplicates is of great significance for both archaeological curation and ancient history study. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our model with state-of-the-art content-based image retrieval and image matching methods, showing that our model yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by domain experts for decades. Code, model and real-world results are available at: https://github.com/cszhangLMU/OBD-Finder/.
中文: 本研究提出了一种渐进式甲骨文重复发现框架,结合无监督关键点匹配与文本内容分析,在实现高效检索的同时,发现了60多对曾被专家遗漏的甲骨文重复样本。
English: This study introduces a progressive framework for detecting duplicate Oracle Bones by integrating unsupervised keypoint matching with text-based content analysis, achieving high retrieval accuracy and efficiency while uncovering over 60 previously missed duplicates.

Authors:Xuechao Wang, Sven Nomm, Junqing Huang, Kadri Medijainen, Aaro Toomela, Michael Ruzhansky
Title: PointExplainer: Towards Transparent Parkinson's Disease Diagnosis
Abstract:
Deep neural networks have shown potential in analyzing digitized hand-drawn signals for early diagnosis of Parkinson's disease. However, the lack of clear interpretability in existing diagnostic methods presents a challenge to clinical trust. In this paper, we propose PointExplainer, an explainable diagnostic strategy to identify hand-drawn regions that drive model diagnosis. Specifically, PointExplainer assigns discrete attribution values to hand-drawn segments, explicitly quantifying their relative contributions to the model's decision. Its key components include: (i) a diagnosis module, which encodes hand-drawn signals into 3D point clouds to represent hand-drawn trajectories, and (ii) an explanation module, which trains an interpretable surrogate model to approximate the local behavior of the black-box diagnostic model. We also introduce consistency measures to further address the issue of faithfulness in explanations. Extensive experiments on two benchmark datasets and a newly constructed dataset show that PointExplainer can provide intuitive explanations with no diagnostic performance degradation. The source code is available at https://github.com/chaoxuewang/PointExplainer.
中文摘要:PointExplainer是一种可解释的诊断策略,通过为手绘片段分配归因值来识别推动帕金森病诊断的关键区域,在保持诊断性能的同时提供直观解释。
English Summary: PointExplainer is an interpretable diagnostic method that identifies key hand-drawn segments for Parkinson's disease detection while maintaining diagnostic accuracy through attribution values and consistency measures.

Authors:Ioannis Nasios
Title: AI-driven multi-source data fusion for algal bloom severity classification in small inland water bodies: Leveraging Sentinel-2, DEM, and NOAA climate data
Abstract:
Harmful algal blooms are a growing threat to inland water quality and public health worldwide, creating an urgent need for efficient, accurate, and cost-effective detection methods. This research introduces a high-performing methodology that integrates multiple open-source remote sensing data with advanced artificial intelligence models. Key data sources include Copernicus Sentinel-2 optical imagery, the Copernicus Digital Elevation Model (DEM), and NOAA's High-Resolution Rapid Refresh (HRRR) climate data, all efficiently retrieved using platforms like Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC). The NIR and two SWIR bands from Sentinel-2, the altitude from the elevation model, the temperature and wind from NOAA as well as the longitude and latitude were the most important features. The approach combines two types of machine learning models, tree-based models and a neural network, into an ensemble for classifying algal bloom severity. While the tree models performed strongly on their own, incorporating a neural network added robustness and demonstrated how deep learning models can effectively use diverse remote sensing inputs. The method leverages high-resolution satellite imagery and AI-driven analysis to monitor algal blooms dynamically, and although initially developed for a NASA competition in the U.S., it shows potential for global application. The complete code is available for further adaptation and practical implementation, illustrating the convergence of remote sensing data and AI to address critical environmental challenges (https://github.com/IoannisNasios/HarmfulAlgalBloomDetection).
中文: 本研究提出一种集成树模型与神经网络的AI方法,结合多源遥感数据高效检测有害藻华,虽最初针对美国水域开发,但具备全球应用潜力,为应对水环境挑战提供了可扩展方案。
English: This study presents an AI-driven ensemble method combining tree-based models and neural networks with multi-source remote sensing data to efficiently detect harmful algal blooms, offering a scalable solution initially developed for U.S. waters but with global applicability.

Authors:Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An
Title: Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning
Abstract:
Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.
Chinese: CoSo提出了一种新颖的在线微调方法,通过反事实推理动态评估单个令牌对处理后动作的因果影响,优先探索关键令牌,从而在多种任务中显著提升探索效率和性能表现。
English: CoSo introduces a novel online fine-tuning method for vision-language model agents that uses counterfactual reasoning to dynamically prioritize exploration of action-critical tokens, significantly improving exploration efficiency and performance across diverse tasks.

Authors:Md Fahim Anjum
Title: When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator
Abstract:
Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
中文: 推理模型在规划框架中作为判别器表现卓越,其文本转SQL能力远超参数更多的非推理模型,但在生成任务中存在局限,揭示了其在智能体框架中的最优角色定位。
English: Reasoning models like DeepSeek-R1 demonstrate superior discrimination capabilities in LLM planning frameworks, significantly outperforming larger non-reasoning models in text-to-SQL tasks despite having fewer parameters, while revealing limitations in generation tasks.

Authors:Eleftherios Tzanis, Michail E. Klontzas
Title: mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging
Abstract:
Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro
中文: mAIstro是一个开源的自主多智能体框架,通过自然语言界面实现医疗AI模型的端到端开发和部署,无需编码即可统一跨多种医疗应用的数据分析、模型构建和推理过程。
English: mAIstro is an open-source, autonomous multi-agent framework that enables end-to-end development and deployment of medical AI models through a natural language interface, eliminating the need for coding and unifying data analysis, model creation, and inference across diverse healthcare applications.

Authors:Eleftherios Tzanis, Michail E. Klontzas
Title: mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging
Abstract:
Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro
中文: mAIstro是一个开源的自主多智能体框架,通过自然语言界面实现医疗AI模型的端到端开发和部署,无需编码即可统一跨多种医疗应用的数据分析、模型构建和推理过程。
English: mAIstro is an open-source, autonomous multi-agent framework that enables end-to-end development and deployment of medical AI models through a natural language interface, eliminating the need for coding and unifying data analysis, model creation, and inference across diverse healthcare applications.

Authors:Asad Aali, Adney Cardoza, Melissa Capo
Title: Splitwiser: Efficient LM inference with constrained resources
Abstract:
Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (https://github.com/asad-aali/splitwiser), and 2) vLLM (https://github.com/adney11/vllm-sysml).
中文摘要:Splitwiser是一种创新方法,通过将LLM推理的提示计算和令牌生成两阶段整合到同一GPU上,有效降低开销并提升资源利用率,已针对Huggingface和vLLM架构开源实现。
English Summary: Splitwiser is a novel method that enhances LLM inference efficiency by co-locating prompt computation and token generation phases on a single GPU, reducing overhead and improving resource utilization, with implementations available for Huggingface and vLLM architectures.

Authors:Yonghao Tan, Pingcheng Dong, Yongkun Wu, Yu Liu, Xuejiao Liu, Peng Luo, Shih-Yang Liu, Xijie Huang, Dong Zhang, Luhong Liang, Kwang-Ting Cheng
Title: APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design
Abstract:
DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at https://github.com/Yonghao-Tan/APSQ.
中文: 本研究提出了一种新颖的加法部分和量化(APSQ)方法,将部分和累加融入量化框架,在多种模型上实现近乎无损的INT8精度压缩,显著降低能耗28-87%,并在LLaMA2-7B上展示了其在大语言模型中的应用潜力。
English: This study introduces Additive Partial Sum Quantization (APSQ), a novel method that integrates partial sum accumulation into quantization to compress high-precision PSUMs to INT8 with nearly lossless performance across various models, achieving significant energy savings of 28-87%.

Authors:Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
Title: VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Abstract:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
中文: 针对流式场景中首个音频令牌生成延迟高的问题,VITA-Audio采用轻量级多模态令牌预测模块和渐进式训练策略,实现了3~5倍的推理加速,并在语音识别、文本转语音和口语问答任务中表现卓越。
English: To overcome the high latency in generating the first audio token during streaming, VITA-Audio introduces a lightweight MCTP module and a progressive training strategy, achieving a 3~5x inference speedup and superior performance in ASR, TTS, and SQA tasks.

Authors:Shiqi Li, Jihua Zhu, Yifan Xie, Naiwen Hu, Di Wang
Title: Matching Distance and Geometric Distribution Aided Learning Multiview Point Cloud Registration
Abstract:
Multiview point cloud registration plays a crucial role in robotics, automation, and computer vision fields. This paper concentrates on pose graph construction and motion synchronization within multiview registration. Previous methods for pose graph construction often pruned fully connected graphs or constructed sparse graph using global feature aggregated from local descriptors, which may not consistently yield reliable results. To identify dependable pairs for pose graph construction, we design a network model that extracts information from the matching distance between point cloud pairs. For motion synchronization, we propose another neural network model to calculate the absolute pose in a data-driven manner, rather than optimizing inaccurate handcrafted loss functions. Our model takes into account geometric distribution information and employs a modified attention mechanism to facilitate flexible and reliable feature interaction. Experimental results on diverse indoor and outdoor datasets confirm the effectiveness and generalizability of our approach. The source code is available at https://github.com/Shi-Qi-Li/MDGD.
中文: 本文提出了一种新的多视角点云配准方法,通过神经网络分析点云对间的匹配距离构建可靠姿态图,并采用数据驱动方式和改进注意力机制计算绝对姿态,在多个数据集上验证了方法的有效性和泛化能力。
English: This paper introduces a novel multiview point cloud registration method that uses neural networks to construct a reliable pose graph by analyzing matching distances and to compute absolute poses through a data-driven approach with an enhanced attention mechanism, demonstrating strong performance across various datasets.

Authors:Sharvi Endait, Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Raviraj Joshi
Title: IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages
Abstract:
The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at https://github.com/l3cube-pune/indic-nlp
中文: 本文介绍了IndicSQuAD,一个基于SQuAD构建的涵盖九种印度语言的多语言问答数据集,旨在解决这些语言在问答系统中的代表性不足问题,并通过基线模型评估揭示了当前面临的挑战和未来研究方向。
English: The paper introduces IndicSQuAD, a multilingual QA dataset for nine Indic languages derived from SQuAD, addressing the underrepresentation of these languages in QA systems and evaluating baseline models to highlight challenges and future research directions.

Authors:Arthur Satouf, Gabriel Ben Zenou, Benjamin Piwowarski, Habiboulaye Amadou Boubacar, Pablo Piantanida
Title: Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval
Abstract:
Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into account the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case -- and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. https://github.com/arthur-75/Rational-Retrieval-Acts
中文摘要:本文采用理性言语行为框架,通过基于整个文档集合动态调整词项与文档的交互作用来改进稀疏神经信息检索模型,从而在跨领域数据集上取得了性能提升和领先成果。
English Summary: The paper adapts the Rational Speech Acts framework to enhance sparse neural information retrieval by dynamically adjusting token-document interactions based on the entire document collection, leading to improved performance and state-of-the-art results on out-of-domain datasets.

Authors:Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Title: RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration
Abstract:
The dawn of embodied intelligence has ushered in an unprecedented imperative for resilient, cognition-enabled multi-agent collaboration across next-generation ecosystems, revolutionizing paradigms in autonomous manufacturing, adaptive service robotics, and cyber-physical production architectures. However, current robotic systems face significant limitations, such as limited cross-embodiment adaptability, inefficient task scheduling, and insufficient dynamic error correction. While End-to-end VLA models demonstrate inadequate long-horizon planning and task generalization, hierarchical VLA models suffer from a lack of cross-embodiment and multi-agent coordination capabilities. To address these challenges, we introduce RoboOS, the first open-source embodied system built on a Brain-Cerebellum hierarchical architecture, enabling a paradigm shift from single-agent to multi-agent intelligence. Specifically, RoboOS consists of three key components: (1) Embodied Brain Model (RoboBrain), a MLLM designed for global perception and high-level decision-making; (2) Cerebellum Skill Library, a modular, plug-and-play toolkit that facilitates seamless execution of multiple skills; and (3) Real-Time Shared Memory, a spatiotemporal synchronization mechanism for coordinating multi-agent states. By integrating hierarchical information flow, RoboOS bridges Embodied Brain and Cerebellum Skill Library, facilitating robust planning, scheduling, and error correction for long-horizon tasks, while ensuring efficient multi-agent collaboration through Real-Time Shared Memory. Furthermore, we enhance edge-cloud communication and cloud-based distributed inference to facilitate high-frequency interactions and enable scalable deployment. Extensive real-world experiments across various scenarios, demonstrate RoboOS's versatility in supporting heterogeneous embodiments. Project website: https://github.com/FlagOpen/RoboOS
中文: RoboOS 采用开创性的脑-小脑分层架构,解决了多智能体协作中的关键瓶颈,实现了异构系统的鲁棒规划和实时同步,推动具身智能的广泛应用。
English: RoboOS introduces a pioneering open-source system with a Brain-Cerebellum architecture to overcome limitations in multi-agent collaboration, enabling robust planning and real-time coordination for diverse applications.

Authors:Yifan Xiang, Zhenxi Zhang, Bin Li, Yixuan Weng, Shoujun Zhou, Yangfan He, Keqin Li
Title: ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
Abstract:
Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.
中文: 针对现有个性化多模态大模型在关系推理上的不足,本研究提出ReGraP数据集和ReGraP-LLaVA模型,通过知识图谱和思维链技术实现了最优的关系推理能力。
English: Recent personalized MLLMs struggle with relational reasoning due to limited training data, prompting the introduction of the ReGraP dataset and ReGraP-LLaVA model, which achieve state-of-the-art performance by incorporating structured knowledge graphs and chain-of-thought reasoning.

Authors:Bernardo Marenco, Paola Bermolen, Marcelo Fiori, Federico Larroca, Gonzalo Mateos
Title: Weighted Random Dot Product Graphs
Abstract:
Modeling of intricate relational patterns has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model's scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights' distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal's latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model's definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG's effectiveness in various network analytic applications.
中文: 本文提出了一种非参数加权随机点积图(WRDPG)模型,通过为节点分配潜位置扩展了RDPG以处理加权图,能够区分均值相同但高阶矩不同的权重分布,并为估计量提供了统计保证及生成图结构的框架。
English: This paper introduces a nonparametric weighted Random Dot Product Graph (WRDPG) model that extends the RDPG to handle weighted graphs by assigning latent positions to nodes, enabling discrimination between weight distributions with identical means but differing higher-order moments, and provides statistical guarantees for the estimator and a generative framework for sampling graphs.

Authors:Alessandro Simoni, Francesco Pelosin
Title: Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map
Abstract:
Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.
Chinese: 本文提出了一种基于扩散模型的工业缺陷数据集生成方法,通过利用增强的边界框表示来生成精确的分割掩码,有效提升了缺陷一致性和空间精度,并验证了该方法在缩小合成与真实数据差距方面的有效性。
English: This paper introduces a diffusion-based pipeline that generates high-fidelity industrial defect datasets with minimal supervision, using enriched bounding box representations to ensure accurate segmentation masks and improve defect consistency and spatial accuracy.

Authors:Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou
Title: Fixed-Length Dense Fingerprint Representation
Abstract:
Fixed-length fingerprint representations, which map each fingerprint to a compact and fixed-size feature vector, are computationally efficient and well-suited for large-scale matching. However, designing a robust representation that effectively handles diverse fingerprint modalities, pose variations, and noise interference remains a significant challenge. In this work, we propose a fixed-length dense descriptor of fingerprints, and introduce FLARE-a fingerprint matching framework that integrates the Fixed-Length dense descriptor with pose-based Alignment and Robust Enhancement. This fixed-length representation employs a three-dimensional dense descriptor to effectively capture spatial relationships among fingerprint ridge structures, enabling robust and locally discriminative representations. To ensure consistency within this dense feature space, FLARE incorporates pose-based alignment using complementary estimation methods, along with dual enhancement strategies that refine ridge clarity while preserving the original fingerprint modality. The proposed dense descriptor supports fixed-length representation while maintaining spatial correspondence, enabling fast and accurate similarity computation. Extensive experiments demonstrate that FLARE achieves superior performance across rolled, plain, latent, and contactless fingerprints, significantly outperforming existing methods in cross-modality and low-quality scenarios. Further analysis validates the effectiveness of the dense descriptor design, as well as the impact of alignment and enhancement modules on the accuracy of dense descriptor matching. Experimental results highlight the effectiveness and generalizability of FLARE as a unified and scalable solution for robust fingerprint representation and matching. The implementation and code will be publicly available at https://github.com/Yu-Yy/FLARE.
中文: FLARE是一个指纹匹配框架,通过结合固定长度密集描述符、姿态对齐和增强策略,能够在多种指纹模态及低质量场景下实现鲁棒且高效的匹配性能。
English: FLARE is a fingerprint matching framework that uses a fixed-length dense descriptor with pose alignment and enhancement to achieve robust, high-performance matching across various fingerprint types and challenging conditions.

Authors:Songchen Fu, Siang Chen, Shaojing Zhao, Letian Bai, Ta Li, Yonghong Yan
Title: Rainbow Delay Compensation: A Multi-Agent Reinforcement Learning Framework for Mitigating Delayed Observation
Abstract:
In real-world multi-agent systems (MASs), observation delays are ubiquitous, preventing agents from making decisions based on the environment's true state. An individual agent's local observation often consists of multiple components from other agents or dynamic entities in the environment. These discrete observation components with varying delay characteristics pose significant challenges for multi-agent reinforcement learning (MARL). In this paper, we first formulate the decentralized stochastic individual delay partially observable Markov decision process (DSID-POMDP) by extending the standard Dec-POMDP. We then propose the Rainbow Delay Compensation (RDC), a MARL training framework for addressing stochastic individual delays, along with recommended implementations for its constituent modules. We implement the DSID-POMDP's observation generation pattern using standard MARL benchmarks, including MPE and SMAC. Experiments demonstrate that baseline MARL methods suffer severe performance degradation under fixed and unfixed delays. The RDC-enhanced approach mitigates this issue, remarkably achieving ideal delay-free performance in certain delay scenarios while maintaining generalizability. Our work provides a novel perspective on multi-agent delayed observation problems and offers an effective solution framework. The source code is available at https://anonymous.4open.science/r/RDC-pymarl-4512/.
中文摘要:本文提出DSID-POMDP框架和彩虹延迟补偿方法,有效解决多智能体系统中随机观测延迟问题,在多种延迟场景下保持性能稳定并展现良好泛化能力。
English Summary: This paper introduces the DSID-POMDP framework and Rainbow Delay Compensation (RDC) method to address stochastic observation delays in multi-agent systems, demonstrating improved performance in delayed scenarios while maintaining generalization across MARL benchmarks.

Authors:Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo Márquez Flores, Sinéad Ryan
Title: OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Abstract:
In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.
Chinese: 本文介绍了OSUniverse基准测试,通过复杂的多模态桌面任务和自动化验证机制来评估高级GUI导航AI代理,旨在挑战当前最先进的人工智能系统,同时确保普通办公人员能够完美完成所有任务。
English: This paper presents OSUniverse, a benchmark for evaluating advanced GUI-navigation AI agents through complex multimodal desktop tasks with automated validation, designed to challenge current state-of-the-art agents while remaining fully achievable for human users.

Authors:Mishal Fatima, Steffen Jung, Margret Keuper
Title: Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Abstract:
Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet-1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change. The dataset and implementation code are available at https://github.com/Mishalfatima/Corner_Cases.
中文: 图像背景因人类审美偏好会产生虚假关联,导致模型尤其在小尺寸或偏离中心的物体上依赖误导性特征,而现有缓解方法未能有效应对这些偏差。
English: Image backgrounds can cause spurious correlations due to human aesthetic preferences, leading models to rely on these misleading features especially when objects are small or off-center, and current mitigation methods fail to address these biases effectively.

Authors:João Alves, Pia Haubro Andersen, Rikke Gade
Title: Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment
Abstract:
The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at https://github.com/jmalves5/read-my-ears.
中文: 本研究开发了基于视频分析的自动化马耳动作检测方法,达到87.5%的准确率,旨在提升马匹情感状态评估水平并推动动物福利实践应用。
English: The study develops automated methods for detecting horse ear movements using video analysis, achieving 87.5% accuracy to improve affective state assessment and advance equine welfare applications.

Authors:Mengfei Duan, Kailun Yang, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Zhiyong Li, Shutao Li
Title: Panoramic Out-of-Distribution Segmentation
Abstract:
Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), achieving OoS for panoramas. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
Chinese: 本文提出了全景分布外分割(PanOoS)这一新任务,通过文本引导的提示分布学习解决方案POS有效解决了全景图像中异常检测的难题,并建立了两个基准数据集,实验证明该方法在性能指标上显著优于现有技术。
English: This paper introduces a novel task called Panoramic Out-of-distribution Segmentation (PanOoS) to address the limitations of existing methods in identifying outliers in panoramic images, proposing a solution named POS that utilizes text-guided prompt distribution learning to enhance segmentation performance and establishing two new benchmarks for evaluation.

Authors:Chuyu Zhao, Hao Huang, Jiashuo Guo, Ziyu Shen, Zhongwei Zhou, Jie Liu, Zekuan Yu
Title: RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT
Abstract:
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.
中文: 提出的RAIL框架通过双组训练和区域感知指导机制,有效解决了半监督3D牙齿分割中的结构模糊区域和不可靠伪标签问题,在有限标注条件下实现了最优性能。
English: The proposed RAIL framework enhances semi-supervised 3D tooth segmentation by introducing dual-group training and region-aware instructive mechanisms, effectively addressing ambiguous regions and unreliable pseudo-labels to achieve state-of-the-art performance with limited annotations.

Authors:Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Kunyang Sun, Bing Liu, Zhiwen Shao, Jiaqi Zhao
Title: Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Abstract:
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.
中文: GDSTrack通过动态图融合和时间扩散技术,有效解决了自监督RGB-T跟踪中的伪标签噪声和模态融合问题,在多个数据集上表现优于现有先进方法。
English: GDSTrack introduces dynamic graph fusion and temporal diffusion to enhance self-supervised RGB-T tracking by mitigating pseudo-label noise and improving modality integration, outperforming current methods on multiple datasets.

Authors:Zhanyuan Jia, Ni Yao, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Fubao Zhu, Chen Zhao, Weihua Zhou
Title: UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion
Abstract:
Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model's robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at https://github.com/chenzhao2023/UPMAD_Net_BrainSeg.
Chinese: 本研究提出了一种融合深度学习和先验知识的新型3D脑肿瘤分割方法,通过多尺度特征融合和不确定性估计,在BraTS数据集上实现了优于现有技术的分割精度。
English: This study introduces a novel 3D brain tumor segmentation method combining deep learning with prior knowledge, achieving superior performance on BraTS datasets through multi-scale feature fusion and uncertainty estimation.

Authors:Qi Gan, Sao Mai Nguyen, Eric Fenaux, Stephan Clémençon, Mounîm El Yacoubi
Title: Polar Coordinate-Based 2D Pose Prior with Neural Distance Field
Abstract:
Human pose capture is essential for sports analysis, enabling precise evaluation of athletes' movements. While deep learning-based human pose estimation (HPE) models from RGB videos have achieved impressive performance on public datasets, their effectiveness in real-world sports scenarios is often hindered by motion blur, occlusions, and domain shifts across different pose representations. Fine-tuning these models can partially alleviate such challenges but typically requires large-scale annotated data and still struggles to generalize across diverse sports environments. To address these limitations, we propose a 2D pose prior-guided refinement approach based on Neural Distance Fields (NDF). Unlike existing approaches that rely solely on angular representations of human poses, we introduce a polar coordinate-based representation that explicitly incorporates joint connection lengths, enabling a more accurate correction of erroneous pose estimations. Additionally, we define a novel non-geodesic distance metric that separates angular and radial discrepancies, which we demonstrate is better suited for polar representations than traditional geodesic distances. To mitigate data scarcity, we develop a gradient-based batch-projection augmentation strategy, which synthesizes realistic pose samples through iterative refinement. Our method is evaluated on a long jump dataset, demonstrating its ability to improve 2D pose estimation across multiple pose representations, making it robust across different domains. Experimental results show that our approach enhances pose plausibility while requiring only limited training data. Code is available at: https://github.com/QGAN2019/polar-NDF.
中文摘要:本文提出了一种基于神经距离场的二维姿态优化方法,通过极坐标表示和新型非测地距离度量,有效提升运动场景中姿态估计的精度,仅需少量训练数据即可增强跨域鲁棒性。
English Summary: This paper introduces a 2D pose refinement method using Neural Distance Fields with polar coordinate representations and a novel non-geodesic distance metric to improve pose estimation accuracy in sports scenarios, requiring minimal training data while enhancing robustness across domains.

Authors:Kirill Lukyanov, Mikhail Drobyshevskiy, Georgii Sazonov, Mikhail Soloviov, Ilya Makarov
Title: Framework GNN-AID: Graph Neural Network Analysis Interpretation and Defense
Abstract:
The growing need for Trusted AI (TAI) highlights the importance of interpretability and robustness in machine learning models. However, many existing tools overlook graph data and rarely combine these two aspects into a single solution. Graph Neural Networks (GNNs) have become a popular approach, achieving top results across various tasks. We introduce GNN-AID (Graph Neural Network Analysis, Interpretation, and Defense), an open-source framework designed for graph data to address this gap. Built as a Python library, GNN-AID supports advanced trust methods and architectural layers, allowing users to analyze graph datasets and GNN behavior using attacks, defenses, and interpretability methods. GNN-AID is built on PyTorch-Geometric, offering preloaded datasets, models, and support for any GNNs through customizable interfaces. It also includes a web interface with tools for graph visualization and no-code features like an interactive model builder, simplifying the exploration and analysis of GNNs. The framework also supports MLOps techniques, ensuring reproducibility and result versioning to track and revisit analyses efficiently. GNN-AID is a flexible tool for developers and researchers. It helps developers create, analyze, and customize graph models, while also providing access to prebuilt datasets and models for quick experimentation. Researchers can use the framework to explore advanced topics on the relationship between interpretability and robustness, test defense strategies, and combine methods to protect against different types of attacks. We also show how defenses against evasion and poisoning attacks can conflict when applied to graph data, highlighting the complex connections between defense strategies. GNN-AID is available at \href{https://github.com/ispras/GNN-AID}{github.com/ispras/GNN-AID}
中文: GNN-AID是一个基于PyTorch-Geometric的开源框架,专门针对图数据填补可信AI的空白,集成了可解释性、鲁棒性和防御机制的工具,为开发者和研究人员提供代码与无代码结合的解决方案。
English: GNN-AID is an open-source Python framework built on PyTorch-Geometric that addresses the gap in Trusted AI by providing integrated tools for interpretability, robustness, and defense mechanisms specifically for graph data, featuring both code and no-code interfaces for developers and researchers.

Authors:Yepeng Liu, Wenpeng Lai, Zhou Zhao, Yuxuan Xiong, Jinchi Zhu, Jun Cheng, Yongchao Xu
Title: LiftFeat: 3D Geometry-Aware Local Feature Matching
Abstract:
Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat}, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.
Chinese: 本文提出了一种名为LiftFeat的轻量级网络,通过融合3D几何特征与2D描述符来提升特征匹配的鲁棒性,在多项视觉任务中展现出优越性能。
English: This paper introduces LiftFeat, a lightweight network that enhances feature matching robustness by integrating 3D geometric features with 2D descriptors, demonstrating superior performance in challenging visual tasks.

Authors:Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li
Title: DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation
Abstract:
Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
中文: 本文提出了一种动态差异感知时序残差网络(DDaTR),通过专门模块有效捕捉空间和时间相关性,在多个基准测试中显著优于现有方法,提升了纵向放射学报告生成的性能。
English: The paper introduces a dynamic difference-aware temporal residual network (DDaTR) that enhances longitudinal radiology report generation by effectively capturing spatial and temporal correlations through specialized modules, significantly outperforming existing methods on multiple benchmarks.

Authors:Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li
Title: DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation
Abstract:
Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
中文: 本文提出了一种动态差异感知时序残差网络(DDaTR),通过专门模块有效捕捉空间和时间相关性,在多个基准测试中显著优于现有方法,提升了纵向放射学报告生成的性能。
English: The paper introduces a dynamic difference-aware temporal residual network (DDaTR) that enhances longitudinal radiology report generation by effectively capturing spatial and temporal correlations through specialized modules, significantly outperforming existing methods on multiple benchmarks.

Authors:Saleh Zare Zade, Yao Qiang, Xiangyu Zhou, Hui Zhu, Mohammad Amin Roshani, Prashant Khanduri, Dongxiao Zhu
Title: Automatic Calibration for Membership Inference Attack on Large Language Models
Abstract:
Membership Inference Attacks (MIAs) have recently been employed to determine whether a specific text was part of the pre-training data of Large Language Models (LLMs). However, existing methods often misinfer non-members as members, leading to a high false positive rate, or depend on additional reference models for probability calibration, which limits their practicality. To overcome these challenges, we introduce a novel framework called Automatic Calibration Membership Inference Attack (ACMIA), which utilizes a tunable temperature to calibrate output probabilities effectively. This approach is inspired by our theoretical insights into maximum likelihood estimation during the pre-training of LLMs. We introduce ACMIA in three configurations designed to accommodate different levels of model access and increase the probability gap between members and non-members, improving the reliability and robustness of membership inference. Extensive experiments on various open-source LLMs demonstrate that our proposed attack is highly effective, robust, and generalizable, surpassing state-of-the-art baselines across three widely used benchmarks. Our code is available at: \href{https://github.com/Salehzz/ACMIA}{\textcolor{blue}{Github}}.
中文: 提出的自动校准成员推理攻击(ACMIA)框架通过可调温度参数有效校准输出概率,增强了大型语言模型中成员推理的可靠性与鲁棒性,在多个基准测试中优于现有方法。
English: The proposed Automatic Calibration Membership Inference Attack (ACMIA) framework effectively calibrates output probabilities using a tunable temperature to enhance the reliability and robustness of membership inference in Large Language Models, outperforming existing methods across multiple benchmarks.

Authors:Hao Liao, Wensheng Lu, Jianxun Lian, Mingqi Wu, Shuo Wang, Yong Zhang, Yitian Huang, Mingyang Zhou, Xing Xie
Title: Avoid Recommending Out-of-Domain Items: Constrained Generative Recommendation with LLMs
Abstract:
Large Language Models (LLMs) have shown promise for generative recommender systems due to their transformative capabilities in user interaction. However, ensuring they do not recommend out-of-domain (OOD) items remains a challenge. We study two distinct methods to address this issue: RecLM-ret, a retrieval-based method, and RecLM-cgen, a constrained generation method. Both methods integrate seamlessly with existing LLMs to ensure in-domain recommendations. Comprehensive experiments on three recommendation datasets demonstrate that RecLM-cgen consistently outperforms RecLM-ret and existing LLM-based recommender models in accuracy while eliminating OOD recommendations, making it the preferred method for adoption. Additionally, RecLM-cgen maintains strong generalist capabilities and is a lightweight plug-and-play module for easy integration into LLMs, offering valuable practical benefits for the community. Source code is available at https://github.com/microsoft/RecAI
中文摘要:大型语言模型在生成式推荐系统中展现出潜力,其中RecLM-cgen方法通过超越其他方法的准确性、消除领域外推荐以及作为轻量级即插即用模块的便捷集成优势,成为优选方案。
English Summary: Large Language Models show promise for generative recommender systems, with RecLM-cgen emerging as the superior method by outperforming other approaches in accuracy while eliminating out-of-domain recommendations and offering easy integration as a lightweight plug-and-play module.

Authors:Guoting Wei, Yu Liu, Xia Yuan, Xizhe Xue, Linlin Guo, Yifan Yang, Chunxia Zhao, Zongwen Bai, Haokui Zhang, Rong Xiao
Title: OS-W2S: An Automatic Labeling Engine for Language-Guided Open-Set Aerial Object Detection
Abstract:
In recent years, language-guided open-set aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary-level descriptions, which fail to meet the demands of fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called MI-OAD, addressing the limitations of current remote sensing grounding data and enabling effective language-guided open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. To demonstrate the effectiveness and quality of MI-OAD, we evaluate three representative tasks. On language-guided open-set aerial detection, training on MI-OAD lifts Grounding DINO by +31.1 AP$_{50}$ and +34.7 Recall@10 with sentence-level inputs under zero-shot transfer. Moreover, using MI-OAD for pre-training yields state-of-the-art performance on multiple existing open-vocabulary aerial detection and remote sensing visual grounding benchmarks, validating both the effectiveness of the dataset and the high quality of its OS-W2S annotations. More details are available at https://github.com/GT-Wei/MI-OAD.
中文摘要:本文提出MI-OAD大规模数据集,包含200万图文对用于语言引导的航空检测,并通过在开放集检测任务中实现显著性能提升验证了其有效性。
English Summary: This paper introduces MI-OAD, a large-scale dataset with 2 million image-caption pairs for language-guided aerial detection, and demonstrates its effectiveness through significant performance improvements in open-set detection tasks.

Authors:Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, Xiangxiang Chu
Title: FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
Abstract:
Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97\%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.
中文:FLUX-Text是一种基于DiT架构的多语言场景文本编辑方法,通过轻量级模块增强字形生成能力,在仅需0.1M训练样本的情况下(比主流方法减少97%),在视觉质量和文本保真度上均优于现有方法。
English: FLUX-Text is a multilingual scene text editing method using a DiT architecture that enhances glyph generation through lightweight modules and achieves superior visual quality and text fidelity with 97% fewer training examples than existing methods.

Authors:Arthur Corrêa, Alexandre Jesus, Cristóvão Silva, Samuel Moniz
Title: Unraveling the Rainbow: can value-based methods schedule?
Abstract:
Recently, deep reinforcement learning has emerged as a promising approach for solving complex combinatorial optimization problems. Broadly, deep reinforcement learning methods fall into two categories: policy-based and value-based. While value-based approaches have achieved notable success in domains such as the Arcade Learning Environment, the combinatorial optimization community has predominantly favored policy-based methods, often overlooking the potential of value-based algorithms. In this work, we conduct a comprehensive empirical evaluation of value-based algorithms, including the deep q-network and several of its advanced extensions, within the context of two complex combinatorial problems: the job-shop and the flexible job-shop scheduling problems, two fundamental challenges with multiple industrial applications. Our results challenge the assumption that policy-based methods are inherently superior for combinatorial optimization. We show that several value-based approaches can match or even outperform the widely adopted proximal policy optimization algorithm, suggesting that value-based strategies deserve greater attention from the combinatorial optimization community. Our code is openly available at: https://github.com/AJ-Correa/Unraveling-the-Rainbow.
中文摘要:本研究通过实证评估表明,在组合优化问题中,基于值的深度强化学习算法能够匹敌甚至超越基于策略的方法,挑战了该领域对后者的主流偏好。
English Summary: This study demonstrates that value-based deep reinforcement learning algorithms can compete with or surpass policy-based methods in complex combinatorial optimization problems, challenging the field's prevailing preference for the latter.

Authors:Jincheng Zhang, György Fazekas, Charalampos Saitis
Title: Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation
Abstract:
The recent surge in the popularity of diffusion models for image synthesis has attracted new attention to their potential for generation tasks in other domains. However, their applications to symbolic music generation remain largely under-explored because symbolic music is typically represented as sequences of discrete events and standard diffusion models are not well-suited for discrete data. We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music. Moreover, this study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform. Classifier-free guidance is utilised to generate symbolic music with target chords. Our evaluation shows that our method achieves compelling results in terms of music quality and controllability, outperforming the strong baseline in pianoroll generation. Our code is available at https://github.com/jinchengzhanggg/proffusion.
中文摘要:本研究提出了一种新颖的扩散模型,通过Transformer-Mamba模块和可学习小波变换,将符号音乐以钢琴卷帘形式生成,在音乐质量和可控性方面均优于现有基线方法。
English Summary: This study introduces a novel diffusion model using Transformer-Mamba blocks and learnable wavelet transforms to generate high-quality symbolic music represented as pianorolls, outperforming existing baselines in both musicality and controllability.

Authors:Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre
Title: Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach
Abstract:
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.
基础模型在计算机视觉领域实现了重大突破,一次训练即可应对多种任务,但遥感领域尚无模型能在所有下游任务中全面领先,因此提出基于“能力编码”的经济高效方法,无需逐项微调即可预测模型性能,简化模型选择并为未来研究指明方向。
Foundation models in computer vision enable versatile task handling post-training, yet no single remote sensing model excels across all tasks, prompting a cost-effective performance prediction method using "capabilities encoding" to simplify model selection and guide future research.

Authors:Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Yaoqi Huang, Hongyu Lyu, Nguyen Hoang Khoi Tran, Tzu-Yun Tseng, Stewart Worrall
Title: OccCylindrical: Multi-Modal Fusion with Cylindrical Representation for 3D Semantic Occupancy Prediction
Abstract:
The safe operation of autonomous vehicles (AVs) is highly dependent on their understanding of the surroundings. For this, the task of 3D semantic occupancy prediction divides the space around the sensors into voxels, and labels each voxel with both occupancy and semantic information. Recent perception models have used multisensor fusion to perform this task. However, existing multisensor fusion-based approaches focus mainly on using sensor information in the Cartesian coordinate system. This ignores the distribution of the sensor readings, leading to a loss of fine-grained details and performance degradation. In this paper, we propose OccCylindrical that merges and refines the different modality features under cylindrical coordinates. Our method preserves more fine-grained geometry detail that leads to better performance. Extensive experiments conducted on the nuScenes dataset, including challenging rainy and nighttime scenarios, confirm our approach's effectiveness and state-of-the-art performance. The code will be available at: https://github.com/DanielMing123/OccCylindrical
Chinese: 本文提出OccCylindrical方法,通过在柱坐标系下优化多传感器特征,提升自动驾驶车辆的三维语义占据预测能力,保留更精细的几何细节,并在nuScenes数据集上实现了领先性能。
English: This paper introduces OccCylindrical, a method that enhances 3D semantic occupancy prediction for autonomous vehicles by refining multisensor features in cylindrical coordinates, preserving fine-grained details and achieving state-of-the-art performance on the nuScenes dataset.

Authors:Yutong Xie, Fuchao Yang, Yuheng Jia
Title: Partial Label Clustering
Abstract:
Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semi-supervised PLL methods when only limited samples are annotated. The code is publicly available at https://github.com/xyt-ml/PLC.
中文摘要:本文首次提出部分标签聚类方法,通过特征空间权重矩阵构建、标签消歧和约束传播的联合优化模型,在有限标注样本下显著超越了现有约束聚类和部分标签学习方法。
English Summary: This paper introduces a novel partial label clustering method that integrates feature-based weight matrix construction, label disambiguation, and constraint propagation into a unified model, demonstrating superior performance over existing constrained clustering and partial label learning approaches.

Authors:Kien Tran Duc Tuan, Tam Nguyen Trong, Son Nguyen Hoang, Khoat Than, Anh Nguyen Duc
Title: Weighted Integrated Gradients for Feature Attribution
Abstract:
In explainable AI, Integrated Gradients (IG) is a widely adopted technique for assessing the significance of feature attributes of the input on model outputs by evaluating contributions from a baseline input to the current input. The choice of the baseline input significantly influences the resulting explanation. While the traditional Expected Gradients (EG) method assumes baselines can be uniformly sampled and averaged with equal weights, this study argues that baselines should not be treated equivalently. We introduce Weighted Integrated Gradients (WG), a novel approach that unsupervisedly evaluates baseline suitability and incorporates a strategy for selecting effective baselines. Theoretical analysis demonstrates that WG satisfies essential explanation method criteria and offers greater stability than prior approaches. Experimental results further confirm that WG outperforms EG across diverse scenarios, achieving an improvement of 10-35\% on main metrics. Moreover, by evaluating baselines, our method can filter a subset of effective baselines for each input to calculate explanations, maintaining high accuracy while reducing computational cost. The code is available at: https://github.com/tamnt240904/weighted_ig.
中文: 本研究提出加权积分梯度(WG)方法,通过无监督评估基线适用性并筛选有效基线,在保持高精度的同时降低计算成本,相比期望梯度法在主要指标上提升10-35%且具有更优稳定性。
English: This study introduces Weighted Integrated Gradients (WG), an unsupervised method that evaluates baseline suitability and selects effective baselines, demonstrating superior performance and stability over Expected Gradients with 10-35% metric improvements while reducing computational costs.

Authors:Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, Zhengfeng Yang, Jujian Zhang, Lihong Zhi, Jia Li, Zhengying Liu
Title: CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
Abstract:
Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for $\textbf{F}$ill-in-the-blank $\textbf{in}$ L$\textbf{e}$an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both ``with solution'' and ``without solution'' scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at https://github.com/MoonshotAI/CombiBench/.
Chinese: 研究人员推出了包含100个形式化组合问题的CombiBench基准及其评估框架Fine-Eval,发现当前大型语言模型在此类任务上表现欠佳,最优模型仅解决了7个问题。
English: Researchers introduce CombiBench, a benchmark of 100 formalized combinatorial problems with an evaluation framework called Fine-Eval, and find that current LLMs perform poorly on these tasks, with the best model solving only 7 problems.

Authors:Teng Zhou, Jax Luo, Yuping Sun, Yiheng Tan, Shun Yao, Nazim Haouchine, Scott Raymond
Title: Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation
Abstract:
Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: https://github.com/kennysyp/PaBoT.
中文摘要:本研究提出了一种新颖的无配对MRI到CT转换方法,通过神经常微分方程和骨骼轮廓正则化技术,显著提升了骨骼等关键解剖结构的转换精度,特别适用于放射治疗规划。
English Summary: This study introduces a novel unpaired MRI-to-CT translation method that uses neural ODEs and bone-contour regularization to improve anatomical accuracy, particularly for bone structures critical in radiation therapy.

Authors:Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu
Title: Image Recognition with Online Lightweight Vision Transformer: A Survey
Abstract:
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
中文摘要:本文综述了为图像识别设计轻量级视觉Transformer的策略,重点研究高效组件设计、动态网络和知识蒸馏,以解决计算难题并分析性能权衡,同时展望了未来研究方向。
English Summary: This paper reviews strategies for creating lightweight vision transformers for image recognition, focusing on efficient design, dynamic networks, and knowledge distillation to address computational challenges while analyzing performance trade-offs.

Authors:Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu
Title: Image Recognition with Online Lightweight Vision Transformer: A Survey
Abstract:
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
中文摘要:本文综述了为图像识别设计轻量级视觉Transformer的策略,重点研究高效组件设计、动态网络和知识蒸馏,以解决计算难题并分析性能权衡,同时展望了未来研究方向。
English Summary: This paper reviews strategies for creating lightweight vision transformers for image recognition, focusing on efficient design, dynamic networks, and knowledge distillation to address computational challenges while analyzing performance trade-offs.

Authors:Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao
Title: Plug-and-Play AMC: Context Is King in Training-Free, Open-Set Modulation with LLMs
Abstract:
Automatic Modulation Classification (AMC) is critical for efficient spectrum management and robust wireless communications. However, AMC remains challenging due to the complex interplay of signal interference and noise. In this work, we propose an innovative framework that integrates traditional signal processing techniques with Large-Language Models (LLMs) to address AMC. Our approach leverages higher-order statistics and cumulant estimation to convert quantitative signal features into structured natural language prompts. By incorporating exemplar contexts into these prompts, our method exploits the LLM's inherent familiarity with classical signal processing, enabling effective one-shot classification without additional training or preprocessing (e.g., denoising). Experimental evaluations on synthetically generated datasets, spanning both noiseless and noisy conditions, demonstrate that our framework achieves competitive performance across diverse modulation schemes and Signal-to-Noise Ratios (SNRs). Moreover, our approach paves the way for robust foundation models in wireless communications across varying channel conditions, significantly reducing the expense associated with developing channel-specific models. This work lays the foundation for scalable, interpretable, and versatile signal classification systems in next-generation wireless networks. The source code is available at https://github.com/RU-SIT/context-is-king
中文摘要:本研究提出了一种将传统信号处理技术与大语言模型相结合的新框架,用于自动调制分类,无需额外训练或预处理即可在不同噪声条件下实现稳健性能。
English Summary: This study introduces a novel framework combining traditional signal processing with Large-Language Models (LLMs) for Automatic Modulation Classification, achieving robust performance across various noise conditions without requiring additional training or preprocessing.

Authors:Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao
Title: Plug-and-Play AMC: Context Is King in Training-Free, Open-Set Modulation with LLMs
Abstract:
Automatic Modulation Classification (AMC) is critical for efficient spectrum management and robust wireless communications. However, AMC remains challenging due to the complex interplay of signal interference and noise. In this work, we propose an innovative framework that integrates traditional signal processing techniques with Large-Language Models (LLMs) to address AMC. Our approach leverages higher-order statistics and cumulant estimation to convert quantitative signal features into structured natural language prompts. By incorporating exemplar contexts into these prompts, our method exploits the LLM's inherent familiarity with classical signal processing, enabling effective one-shot classification without additional training or preprocessing (e.g., denoising). Experimental evaluations on synthetically generated datasets, spanning both noiseless and noisy conditions, demonstrate that our framework achieves competitive performance across diverse modulation schemes and Signal-to-Noise Ratios (SNRs). Moreover, our approach paves the way for robust foundation models in wireless communications across varying channel conditions, significantly reducing the expense associated with developing channel-specific models. This work lays the foundation for scalable, interpretable, and versatile signal classification systems in next-generation wireless networks. The source code is available at https://github.com/RU-SIT/context-is-king
中文摘要:本研究提出了一种将传统信号处理技术与大语言模型相结合的新框架,用于自动调制分类,无需额外训练或预处理即可在不同噪声条件下实现稳健性能。
English Summary: This study introduces a novel framework combining traditional signal processing with Large-Language Models (LLMs) for Automatic Modulation Classification, achieving robust performance across various noise conditions without requiring additional training or preprocessing.

Authors:Pau Amargant, Peter Hönig, Markus Vincze
Title: Sim2Real Transfer for Vision-Based Grasp Verification
Abstract:
The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at https://github.com/pauamargant/HSR-GraspSynth .
中文: 本研究提出了一种基于视觉的抓取验证方法,采用YOLO和ResNet双阶段架构并辅以合成数据集HSR-GraspSynth,在现实环境中对可变形物体的机器人抓取实现了高精度识别。
English: This study introduces a vision-based grasp verification method using a two-stage YOLO and ResNet architecture, supplemented by a synthetic dataset HSR-GraspSynth, achieving high real-world accuracy for robotic manipulation of deformable objects.

Authors:Saeed Ebrahimi, Sahar Rahimi, Ali Dabouei, Srinjoy Das, Jeremy M. Dawson, Nasser M. Nasrabadi
Title: GIF: Generative Inspiration for Face Recognition at Scale
Abstract:
Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR$=1e-4$ on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at https://github.com/msed-Ebrahimi/GIF
中文摘要:本研究提出一种受生成模型启发的创新方法,通过将标量标签替换为结构化身份代码,将人脸识别的计算成本从与身份数量的线性关系转变为对数关系,并在基准测试中实现了更优的性能表现。
English Summary: This study introduces a generative modeling-inspired method that replaces scalar labels with structured identity codes to transform the computational cost of face recognition from linear to logarithmic relative to the number of identities, achieving superior performance on benchmark datasets.

Authors:Nikolay Safonov, Alexey Bryncev, Andrey Moskalenko, Dmitry Kulikov, Dmitry Vatolin, Radu Timofte, Haibo Lei, Qifan Gao, Qing Luo, Yaqing Li, Jie Song, Shaozhe Hao, Meisong Zheng, Jingyi Xu, Chengbin Wu, Jiahui Liu, Ying Chen, Xin Deng, Mai Xu, Peipei Liang, Jie Ma, Junjie Jin, Yingxue Pang, Fangzhou Luo, Kai Chen, Shijie Zhao, Mingyang Wu, Renjie Li, Yushen Zuo, Shengyun Zhong, Zhengzhong Tu
Title: NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results
Abstract:
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.
本文介绍了NTIRE 2025用户生成内容视频增强挑战赛,该赛事吸引了25支以上团队开发算法以改善现实世界中质量退化的视频,并通过大规模主观评估进行测试,推动了用户生成内容质量提升的实用解决方案。
This paper introduces the NTIRE 2025 UGC Video Enhancement Challenge, which engaged over 25 teams in developing algorithms to improve real-world degraded videos and evaluated them through large-scale subjective assessments, advancing practical solutions for user-generated content quality.

Authors:Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Abstract:
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
中文:RADLADS协议能以极少的训练数据和成本将softmax注意力变换器高效转换为线性注意力解码器,在保持接近原模型质量的同时实现领先性能,并以开放许可证发布。
English: RADLADS enables efficient conversion of softmax attention transformers into linear attention decoders using minimal training tokens and cost, achieving near-original quality and state-of-the-art performance while being released under open licenses.

Authors:Anjila Budathoki, Manish Dhakal
Title: Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation
Abstract:
Adversarial attacks have been fairly explored for computer vision and vision-language models. However, the avenue of adversarial attack for the vision language segmentation models (VLSMs) is still under-explored, especially for medical image analysis. Thus, we have investigated the robustness of VLSMs against adversarial attacks for 2D medical images with different modalities with radiology, photography, and endoscopy. The main idea of this project was to assess the robustness of the fine-tuned VLSMs specially in the medical domain setting to address the high risk scenario. First, we have fine-tuned pre-trained VLSMs for medical image segmentation with adapters. Then, we have employed adversarial attacks -- projected gradient descent (PGD) and fast gradient sign method (FGSM) -- on that fine-tuned model to determine its robustness against adversaries. We have reported models' performance decline to analyze the adversaries' impact. The results exhibit significant drops in the DSC and IoU scores after the introduction of these adversaries. Furthermore, we also explored universal perturbation but were not able to find for the medical images. \footnote{https://github.com/anjilab/secure-private-ai}
中文: 本研究探讨了在医学影像中微调后的视觉语言分割模型对抗性攻击的鲁棒性,发现尽管未能成功生成通用扰动,但在遭受PGD和FGSM攻击时模型性能出现显著下降。
English: This study investigates the robustness of fine-tuned vision-language segmentation models against adversarial attacks in medical imaging, revealing significant performance declines when subjected to PGD and FGSM attacks despite unsuccessful attempts to generate universal perturbations.

Authors:Franklin Zhang, Sonya Zhang, Alon Halevy
Title: Leveraging LLMs to Create Content Corpora for Niche Domains
Abstract:
Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.
中文: 本文提出了一种利用大型语言模型从网络数据中高效构建高质量领域专用语料库的简化方法,通过在习惯养成应用中的验证,成功提取了数千条挑战内容并获得了用户的高度满意度。
English: This paper introduces a streamlined approach using Large Language Models to efficiently create high-quality, domain-specific corpora from web data, validated through a habit-formation application where it successfully extracted thousands of challenges and achieved high user satisfaction.

Authors:Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Title: Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Abstract:
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
中文摘要:SAGE框架通过在多轮对话中模拟类人情感变化与内心活动,评估大语言模型的社会认知能力,揭示了前沿模型与早期基线之间的显著差距,这些差距是传统评测体系未能体现的。
English Summary: The SAGE framework evaluates large language models' social cognition by simulating human-like emotional responses and inner thoughts during conversations, revealing significant performance gaps between advanced and baseline models that traditional benchmarks miss.

Authors:Zhikai Wang, Yanyan Shen, Zibin Zhang, Kangyi Lin
Title: Feature Staleness Aware Incremental Learning for CTR Prediction
Abstract:
Click-through Rate (CTR) prediction in real-world recommender systems often deals with billions of user interactions every day. To improve the training efficiency, it is common to update the CTR prediction model incrementally using the new incremental data and a subset of historical data. However, the feature embeddings of a CTR prediction model often get stale when the corresponding features do not appear in current incremental data. In the next period, the model would have a performance degradation on samples containing stale features, which we call the feature staleness problem. To mitigate this problem, we propose a Feature Staleness Aware Incremental Learning method for CTR prediction (FeSAIL) which adaptively replays samples containing stale features. We first introduce a staleness aware sampling algorithm (SAS) to sample a fixed number of stale samples with high sampling efficiency. We then introduce a staleness aware regularization mechanism (SAR) for a fine-grained control of the feature embedding updating. We instantiate FeSAIL with a general deep learning-based CTR prediction model and the experimental results demonstrate FeSAIL outperforms various state-of-the-art methods on four benchmark datasets.
Chinese: FeSAIL方法通过自适应回放陈旧特征样本和精细化嵌入正则化,有效缓解了增量CTR模型训练中的特征陈旧问题,提升了模型性能。
English: FeSAIL addresses the feature staleness issue in incremental CTR model training by adaptively replaying stale feature samples and applying fine-grained embedding regularization to enhance performance.

Authors:Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
Title: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Abstract:
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
中文: 本文提出StableReinforce算法,通过优化训练损失、优势估计和奖励设计来稳定强化学习训练,在构建的多模态偏好数据集上训练的奖励模型在各项基准测试中取得了显著性能提升。
English: This paper introduces StableReinforce, a refined reinforcement learning algorithm that stabilizes training and enhances multimodal reward models, achieving significant performance improvements on benchmarks with collected preference data.

Authors:Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang
Title: No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
Abstract:
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtains representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in the earlier layer with higher noise to that in the later layer with lower noise to progressively enhance the overall representation learning during only the generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that are heavily dependent on powerful external representation priors.
中文摘要:最新研究表明,学习有意义的内部表征可提升扩散变换器的训练效率与生成质量,由此提出的自表征对齐方法无需外部组件,仅通过跨噪声层级的潜在表征对齐即实现显著性能提升,效果媲美依赖强外部先验的方法。
English Summary: Recent research shows that learning meaningful internal representations improves diffusion transformer training and generation quality, leading to the Self-Representation Alignment method that aligns latent representations across noise levels without external components, achieving significant performance gains.

Authors:Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He
Title: MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing
Abstract:
Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.
中文:提出的MUSAR框架通过去偏双联学习和动态注意力路由机制,仅需单主体训练数据即可解决多主体定制中的数据获取困难和属性纠缠问题,实现卓越的多主体生成效果。
English: The proposed MUSAR framework effectively addresses multi-subject customization challenges by using single-subject training data through debiased diptych learning and dynamic attention routing to eliminate data limitations and attribute entanglement.

Authors:Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Title: ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Abstract:
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
Chinese: ReplaceMe 是一种无需训练的深度剪枝方法,通过线性操作替换 transformer 模块,在无需重新训练的情况下实现高达 25% 的剪枝率,同时保持约 90% 的原始性能。
English: ReplaceMe is a training-free depth pruning method that replaces transformer blocks with a linear operation, achieving up to 25% pruning while maintaining approximately 90% of original performance without retraining.

Authors:Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Title: ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Abstract:
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
Chinese: ReplaceMe 是一种无需训练的深度剪枝方法,通过线性操作替换 transformer 模块,在无需重新训练的情况下实现高达 25% 的剪枝率,同时保持约 90% 的原始性能。
English: ReplaceMe is a training-free depth pruning method that replaces transformer blocks with a linear operation, achieving up to 25% pruning while maintaining approximately 90% of original performance without retraining.

Authors:Jai Prakash Veerla, Partha Sai Guttikonda, Helen H. Shang, Mohammad Sadegh Nasr, Cesar Torres, Jacob M. Luber
Title: Beyond the Monitor: Mixed Reality Visualization and AI for Enhanced Digital Pathology Workflow
Abstract:
Pathologists rely on gigapixel whole-slide images (WSIs) to diagnose diseases like cancer, yet current digital pathology tools hinder diagnosis. The immense scale of WSIs, often exceeding 100,000 X 100,000 pixels, clashes with the limited views traditional monitors offer. This mismatch forces constant panning and zooming, increasing pathologist cognitive load, causing diagnostic fatigue, and slowing pathologists' adoption of digital methods. PathVis, our mixed-reality visualization platform for Apple Vision Pro, addresses these challenges. It transforms the pathologist's interaction with data, replacing cumbersome mouse-and-monitor navigation with intuitive exploration using natural hand gestures, eye gaze, and voice commands in an immersive workspace. PathVis integrates AI to enhance diagnosis. An AI-driven search function instantly retrieves and displays the top five similar patient cases side-by-side, improving diagnostic precision and efficiency through rapid comparison. Additionally, a multimodal conversational AI assistant offers real-time image interpretation support and aids collaboration among pathologists across multiple Apple devices. By merging the directness of traditional pathology with advanced mixed-reality visualization and AI, PathVis improves diagnostic workflows, reduces cognitive strain, and makes pathology practice more effective and engaging. The PathVis source code and a demo video are publicly available at: https://github.com/jaiprakash1824/Path_Vis
中文: PathVis是一个针对Apple Vision Pro的混合现实平台,通过手势导航和集成AI进行病例比对与实时辅助,优化了数字病理学工作流程,减轻了认知负担并提升了诊断效率。
English: PathVis is a mixed-reality platform for Apple Vision Pro that enhances digital pathology by enabling intuitive navigation through hand gestures and integrating AI for case comparison and real-time assistance, reducing cognitive load and improving diagnostic efficiency.

Authors:Yankai Jiang, Peng Zhang, Donglin Yang, Yuan Tian, Hai Lin, Xiaosong Wang
Title: Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
Abstract:
We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at https://github.com/Yankai96/DiffuGTS.
Chinese: 本文提出DiffuGTS框架,通过利用冻结的医学基础扩散模型生成异常感知注意力图,并采用残差学习优化分割掩码,实现了跨多种数据集和肿瘤类别的零样本泛化肿瘤分割,性能显著优于现有先进方法。
English: This paper introduces DiffuGTS, a novel framework that utilizes frozen medical foundation diffusion models for zero-shot generalizable tumor segmentation by generating anomaly-aware attention maps and refining masks through residual learning, achieving superior performance across multiple datasets and tumor categories.

Authors:Binghong Chen, Tingting Chai, Wei Jiang, Yuanrong Xu, Guanglu Zhou, Xiangqian Wu
Title: Multi-View Learning with Context-Guided Receptance for Image Denoising
Abstract:
Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (\M) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. Our approach introduces the Context-guided Token Shift (CTS) paradigm, which effectively captures local spatial dependencies and enhance the model's ability to model real-world noise distributions. Additionally, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the existing state-of-the-art methods quantitatively and reducing inference time up to 40\%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes.
Chinese: 本文提出用于图像去噪的上下文引导接收加权键值模型,通过创新机制整合空间和频率特征,有效处理真实世界噪声模式,同时相比现有方法将推理时间减少高达40%。
English: This paper introduces the Context-guided Receptance Weighted Key-Value model for image denoising, which integrates spatial and frequency features through novel mechanisms to effectively handle real-world noise patterns while reducing inference time by up to 40% compared to existing methods.

Authors:Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux
Title: fastabx: A library for efficient computation of ABX discriminability
Abstract:
We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at https://github.com/bootphon/fastabx.
中文:Fastabx 是一个高性能的 Python 库,旨在高效构建和计算 ABX 判别任务,填补了在语音处理之外多个领域中评估学习表示类别区分能力的工具空白。
English: Fastabx is a high-performance Python library designed to efficiently build and compute ABX discrimination tasks, addressing the lack of tools for evaluating category separation in learned representations across various domains beyond speech processing.

Authors:Xiaobao Wu
Title: Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards
Abstract:
Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.
Chinese Summary: 近期大语言模型的发展重点转向基于奖励的学习范式,通过强化学习、奖励引导解码等技术,利用奖励信号指导模型行为,实现从静态数据被动学习到动态反馈主动学习的转变,从而增强模型的对齐能力和深度推理能力。
English Summary: Recent advancements in large language models are increasingly centered on learning from rewards, a paradigm that uses reward signals to guide model behavior through techniques like reinforcement learning and reward-guided decoding, enabling active learning from dynamic feedback for improved alignment and reasoning.

Authors:Shiwei Guo, Ziang Chen, Yupeng Ma, Yunfei Han, Yi Wang
Title: SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting
Abstract:
The Transformer model has shown strong performance in multivariate time series forecasting by leveraging channel-wise self-attention. However, this approach lacks temporal constraints when computing temporal features and does not utilize cumulative historical series effectively.To address these limitations, we propose the Structured Channel-wise Transformer with Cumulative Historical state (SCFormer). SCFormer introduces temporal constraints to all linear transformations, including the query, key, and value matrices, as well as the fully connected layers within the Transformer. Additionally, SCFormer employs High-order Polynomial Projection Operators (HiPPO) to deal with cumulative historical time series, allowing the model to incorporate information beyond the look-back window during prediction. Extensive experiments on multiple real-world datasets demonstrate that SCFormer significantly outperforms mainstream baselines, highlighting its effectiveness in enhancing time series forecasting. The code is publicly available at https://github.com/ShiweiGuo1995/SCFormer
Chinese: SCFormer模型通过在线性变换中引入时间约束并采用高阶多项式投影算子处理累积历史序列,有效提升了多元时间序列预测性能,在多个真实数据集上的实验表明其显著优于主流基线方法。
English: The SCFormer model enhances multivariate time series forecasting by incorporating temporal constraints into all linear transformations and using High-order Polynomial Projection Operators to effectively utilize cumulative historical data, significantly outperforming existing methods in experiments.

Authors:Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
Title: LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Abstract:
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
中文: 本文介绍了LLaMA-Omni 2系列语音语言模型,通过集成语音编码器和流式解码器实现高质量实时语音交互,在仅使用少量训练数据的情况下,其性能超越了基于海量语音数据训练的现有最优模型。
English: This paper presents LLaMA-Omni 2, a series of speech language models that enable high-quality real-time speech interaction by integrating a speech encoder and streaming decoder, achieving superior performance on benchmarks despite minimal training data.

Authors:Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Abstract:
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).
中文摘要:本综述探讨了多模态理解与图像生成模型的统一路径,通过分析三种主流架构范式并指出关键挑战,为这一新兴领域的未来研究提供指导。
English Summary: This survey explores the unification of multimodal understanding and image generation models, analyzing three architectural paradigms and addressing key challenges to guide future research in this emerging field.

Authors:Hongze Li, Zesheng Zhou, Zhenbiao Cao, Xinhui Li, Wei Chen, Xiaojin Zhang
Title: FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization
Abstract:
Traditional Federated Domain Generalization (FedDG) methods focus on learning domain-invariant features or adapting to unseen target domains, often overlooking the unique knowledge embedded within the source domain, especially in strictly isolated federated learning environments. Through experimentation, we discovered a counterintuitive phenomenon.: features learned from a complete source domain have superior generalization capabilities compared to those learned directly from the target domain. This insight leads us to propose the Federated Source Domain Awareness Framework (FedSDAF), the first systematic approach to enhance FedDG by leveraging source domain-aware features. FedSDAF employs a dual-adapter architecture that decouples "local expertise" from "global generalization consensus". A Domain-Aware Adapter, retained locally, extracts and protects the unique discriminative knowledge of each source domain, while a Domain-Invariant Adapter, shared across clients, builds a robust global consensus. To enable knowledge exchange, we introduce a Bidirectional Knowledge Distillation mechanism that facilitates efficient dialogue between the adapters. Extensive experiments on four benchmark datasets (OfficeHome, PACS, VLCS, DomainNet) show that FedSDAF significantly outperforms existing FedDG methods.The source code is available at https://github.com/pizzareapers/FedSDAF.
中文摘要:传统联邦域泛化方法常忽视源域知识,但本研究发现源域特征比目标域特征具有更强的泛化能力,据此提出采用双向知识蒸馏的双适配器FedSDAF框架,在多个基准测试中显著优于现有方法。
English Summary: Traditional Federated Domain Generalization methods often neglect source domain knowledge, but this research reveals that source domain features actually generalize better than target domain features, leading to the innovative FedSDAF framework with dual adapters and bidirectional knowledge distillation that significantly outperforms existing methods.

Authors:Xiongjun Guan, Zhiyu Pan, Jianjiang Feng, Jie Zhou
Title: Finger Pose Estimation for Under-screen Fingerprint Sensor
Abstract:
Two-dimensional pose estimation plays a crucial role in fingerprint recognition by facilitating global alignment and reduce pose-induced variations. However, existing methods are still unsatisfactory when handling with large angle or small area inputs. These limitations are particularly pronounced on fingerprints captured by under-screen fingerprint sensors in smartphones. In this paper, we present a novel dual-modal input based network for under-screen fingerprint pose estimation. Our approach effectively integrates two distinct yet complementary modalities: texture details extracted from ridge patches through the under-screen fingerprint sensor, and rough contours derived from capacitive images obtained via the touch screen. This collaborative integration endows our network with more comprehensive and discriminative information, substantially improving the accuracy and stability of pose estimation. A decoupled probability distribution prediction task is designed, instead of the traditional supervised forms of numerical regression or heatmap voting, to facilitate the training process. Additionally, we incorporate a Mixture of Experts (MoE) based feature fusion mechanism and a relationship driven cross-domain knowledge transfer strategy to further strengthen feature extraction and fusion capabilities. Extensive experiments are conducted on several public datasets and two private datasets. The results indicate that our method is significantly superior to previous state-of-the-art (SOTA) methods and remarkably boosts the recognition ability of fingerprint recognition algorithms. Our code is available at https://github.com/XiongjunGuan/DRACO.
中文摘要:本文提出了一种新颖的双模态输入网络,通过整合来自屏下指纹传感器的脊线纹理细节和触摸屏电容图像的粗略轮廓,显著提高了指纹姿态估计的准确性和稳定性,在多个数据集上均优于现有最优方法。
English Summary: This paper introduces a novel dual-modal network that integrates texture details from ridge patches and rough contours from capacitive images to significantly enhance the accuracy and stability of under-screen fingerprint pose estimation, outperforming previous state-of-the-art methods.

Authors:Inclusion AI, Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
Title: Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Abstract:
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
中文:Ming-Lite-Uni是一个开源多模态框架,通过原生自回归模型和视觉生成器统一视觉与语言,能够执行文本到图像生成和指令式图像编辑任务,实验证明其具有卓越性能。
English: Ming-Lite-Uni is an open-source multimodal framework that unifies vision and language through a native autoregressive model and visual generator, enabling text-to-image generation and image editing while demonstrating strong performance in experiments.

Authors:Jiaqi Zhang, Zhuodong Liu, Kejian Yu
Title: MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection
Abstract:
Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model's generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: https://github.com/Healer-ML/MSFNet-CPD.
中文: 本研究提出MSFNet-CPD多模态害虫检测网络,通过超分辨率重建提升图像清晰度并融合视觉-文本特征,有效解决害虫细粒度识别难题,在新建基准上实现了最优性能。
English: This study introduces MSFNet-CPD, a multi-modal pest detection network that enhances image clarity through super-resolution reconstruction and integrates visual-textual features to address challenges in fine-grained pest identification, achieving state-of-the-art performance on new benchmarks.

Authors:Zichen Liu, Xu Zou, Gang Hua, Jiahuan Zhou
Title: Token Coordinated Prompt Attention is Needed for Visual Prompting
Abstract:
Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at https://github.com/zhoujiahuan1991/ICML2025-TCPA.
中文: 提出的令牌协调提示注意力(TCPA)模块通过为不同令牌分配专用提示——CLS提示用于全局信息,图像提示用于局部特征,解决了现有视觉提示方法的局限性,通过基于注意力的交互提升了特征的多样性和判别力。
English: The proposed Token Coordinated Prompt Attention (TCPA) module addresses limitations in existing visual prompting methods by assigning specialized prompts to different tokens—CLS Prompts for global information and Image Prompts for local features—enhancing feature diversity and discriminative power through attention-based interactions.

Authors:Sungheon Jeong, Jihong Park, Mohsen Imani
Title: Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Abstract:
Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at https://github.com/EavnJeong/IEF-VAD.
中文: 提出的IEF-VAD框架通过从RGB视频合成事件表征,并采用不确定性感知方法将其与图像特征融合,无需专用事件传感器即可实现最先进的视频异常检测性能。
English: The proposed IEF-VAD framework enhances video anomaly detection by synthesizing event representations from RGB videos and fusing them with image features through an uncertainty-aware process, achieving state-of-the-art performance without requiring dedicated event sensors.

Authors:Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
Title: Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Abstract:
Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.
中文摘要:GVM-RAFT提出动态采样策略,通过监控接受率和梯度范数优化计算资源分配,在数学推理任务中相比传统RAFT实现2-4倍加速收敛与显著精度提升。
English Summary: GVM-RAFT introduces a dynamic sampling strategy that optimizes computational resource allocation by minimizing gradient variance, achieving 2-4x faster convergence and higher accuracy in mathematical reasoning tasks compared to standard RAFT.

Authors:Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian
Title: Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Abstract:
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
中文摘要:本技术报告提出了一种名为DQ3_K_M的动态3位量化方法,能够在保持与4位量化相当性能的同时,实现DeepSeek模型在标准NVIDIA GPU设备上的单机部署。
English Summary: This technical report introduces a dynamic 3-bit quantization method called DQ3_K_M that enables single-machine deployment of DeepSeek models while maintaining performance comparable to 4-bit quantization across various benchmarks.

Authors:Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Title: RM-R1: Reward Modeling as Reasoning
Abstract:
Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.
中文: 本文提出推理奖励模型(ReasRMs),通过引入推理能力和链式评分机制,结合两阶段训练流程,显著提升了奖励模型的性能和可解释性,在多个基准测试中达到最优水平,并以高达4.9%的优势超越更大规模的模型。
English: This paper introduces Reasoning Reward Models (ReasRMs), which enhance reward modeling by incorporating reasoning capabilities through a chain-of-rubrics mechanism and a two-stage training process, achieving state-of-the-art performance across benchmarks while outperforming larger models by up to 4.9%.

Authors:Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
Title: SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Abstract:
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
中文: 本文提出了一种新颖的图像编辑指令优化方法,通过校正指令和构建对比监督信号来提升编辑效果,在显著减少训练数据和模型规模的同时,实现了优于现有方法的性能表现。
English: This paper introduces a novel approach to improve instruction-based image editing by constructing more effective editing instructions through rectification and contrastive supervision, achieving superior performance with significantly reduced training data and model size compared to existing methods.

Authors:Vincent-Daniel Yun
Title: Sharpness-Aware Minimization with Z-Score Gradient Filtering
Abstract:
Deep neural networks achieve high performance across many domains but can still face challenges in generalization when optimization is influenced by small or noisy gradient components. Sharpness-Aware Minimization improves generalization by perturbing parameters toward directions of high curvature, but it uses the entire gradient vector, which means that small or noisy components may affect the ascent step and cause the optimizer to miss optimal solutions. We propose Z-Score Filtered Sharpness-Aware Minimization, which applies Z-score based filtering to gradients in each layer. Instead of using all gradient components, a mask is constructed to retain only the top percentile with the largest absolute Z-scores. The percentile threshold $Q_p$ determines how many components are kept, so that the ascent step focuses on directions that stand out most compared to the average of the layer. This selective perturbation refines the search toward flatter minima while reducing the influence of less significant gradients. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with architectures including ResNet, VGG, and Vision Transformers show that the proposed method consistently improves test accuracy compared to Sharpness-Aware Minimization and its variants. The code repository is available at: https://github.com/YUNBLAK/Sharpness-Aware-Minimization-with-Z-Score-Gradient-Filtering
中文: Z-Score过滤的锐度感知最小化通过基于Z分数的梯度筛选,选择性地扰动显著梯度分量以寻找更平坦的极小值,在多个数据集和架构上均优于现有方法。
English: Z-Score Filtered Sharpness-Aware Minimization enhances generalization by selectively perturbing gradients based on Z-score filtering, focusing on significant components to find flatter minima and outperforming existing methods across multiple datasets and architectures.

Authors:Bobo Lian, Dandan Wang, Chenjian Wu, Minxin Chen
Title: Sparse Ellipsoidal Radial Basis Function Network for Point Cloud Surface Representation
Abstract:
Point cloud surface representation is a fundamental problem in computer graphics and vision. This paper presents a machine learning approach for approximating the signed distance function (SDF) of a point cloud using a sparse ellipsoidal radial basis function network, enabling a compact and accurate surface representation. Given the SDF values defined on the grid points constructed from the point cloud, our method approximates the SDF accurately with as few ellipsoidal radial basis functions (ERBFs) as possible, i.e., represents the SDF of a point cloud by sparse ERBFs. To balance sparsity and approximation precision, a dynamic multi-objective optimization strategy is introduced, which adaptively adds the regularization terms and jointly optimizes the weights, centers, shapes, and orientations of ERBFs. To improve computational efficiency, a nearest-neighbor-based data structure is employed, restricting function calculations to points near each Gaussian kernel center. The computations for each kernel are further parallelized on CUDA, which significantly improves the optimization speed. Additionally, a hierarchical octree-based refinement strategy is designed for training. Specifically, the initialization and optimization of network parameters are conducted using coarse grid points in the octree lattice structure. Subsequently, fine lattice points are progressively incorporated to accelerate model convergence and enhance training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms previous sparse representation approaches in terms of accuracy, robustness, and computational efficiency. The corresponding executable program is publicly available at https://github.com/lianbobo/SE-RBFNet.git.
Chinese: 本文提出一种机器学习方法,利用稀疏椭球径向基函数网络精确高效地逼近点云的符号距离函数,在精度、鲁棒性和计算效率方面均优于现有稀疏表示方法。
English: This paper introduces a machine learning method that uses a sparse ellipsoidal radial basis function network to accurately and efficiently approximate the signed distance function of point clouds, achieving superior performance in accuracy, robustness, and computational efficiency compared to previous approaches.

Authors:Kahim Wong, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si
Title: An End-to-End Model For Logits Based Large Language Models Watermarking
Abstract:
The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37-39% under paraphrasing and 17.2% on average, while maintaining text quality on par with these distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. Code is available at https://github.com/KAHIMWONG/E2E_LLM_WM.
Chinese Summary: 本文提出了一种端到端的LLM水印方法,通过联合优化和可微分替代技术,在保持文本质量的同时显著提升了抗篡改鲁棒性。
English Summary: This paper introduces an end-to-end logits perturbation watermarking method for LLMs that achieves superior robustness against text modifications while maintaining text quality through joint optimization and a differentiable surrogate technique.

Authors:Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong
Title: VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
Abstract:
Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.
中文摘要:VAEmo是一个两阶段框架,首先通过无情感标签的预训练学习统一视听表征,随后利用生成的情感描述注入语义信息,以紧凑设计在多个人机交互基准上实现了最先进的视听情感识别性能。
English Summary: VAEmo is a two-stage framework that first learns unified audiovisual representations without emotion labels and then injects emotion-aware semantics through generated textual descriptions, achieving state-of-the-art performance in audiovisual emotion recognition with efficient cross-modal encoding.

Authors:Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, Xinwei He
Title: TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Abstract:
Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.
中文: 本文提出测试时分布对齐(TeDA)框架,通过多视角投影和迭代优化将预训练的CLIP模型适配于零样本3D物体检索,无需大量训练即可超越现有最优方法。
English: This paper introduces Testing-time Distribution Alignment (TeDA), a novel framework that adapts the pre-trained CLIP model for zero-shot 3D object retrieval by aligning 2D and 3D distributions through multi-view projections and iterative optimization, outperforming existing methods without requiring extensive training.

Authors:Michael F. Herbst, Bonan Sun
Title: Efficient Krylov methods for linear response in plane-wave electronic structure calculations
Abstract:
We propose a novel algorithm based on inexact GMRES methods for linear response calculations in density functional theory. Such calculations require iteratively solving a nested linear problem $\mathcal{E} δρ= b$ to obtain the variation of the electron density $δρ$. Notably each application of the dielectric operator $\mathcal{E}$ in turn requires the iterative solution of multiple linear systems, the Sternheimer equations. We develop computable bounds to estimate the accuracy of the density variation given the tolerances to which the Sternheimer equations have been solved. Based on this result we suggest reliable strategies for adaptively selecting the convergence tolerances of the Sternheimer equations, such that each applications of $\mathcal{E}$ is no more accurate than needed. Experiments on challenging materials systems of practical relevance demonstrate our strategies to achieve superlinear convergence as well as a reduction of computational time by about 40% while preserving the accuracy of the returned response solution. Our algorithm seamlessly combines with standard preconditioning approaches known from the context of self-consistent field problems making it a promising framework for efficient response solvers based on Krylov subspace techniques.
Chinese: 本研究提出了一种基于非精确GMRES的线性响应计算新算法,通过自适应调整收敛容差,在保持精度的同时实现了超线性收敛,并将计算时间减少了约40%。
English: This study introduces an inexact GMRES-based algorithm for linear response calculations in density functional theory, which adaptively adjusts convergence tolerances to reduce computational time by 40% while maintaining accuracy and achieving superlinear convergence.

Authors:James Read, Ming-Yen Lee, Wei-Hsing Huang, Yuan-Chun Luo, Anni Lu, Shimeng Yu
Title: NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities
Abstract:
The exponential growth of artificial intelligence (AI) applications has exposed the inefficiency of conventional von Neumann architectures, where frequent data transfers between compute units and memory create significant energy and latency bottlenecks. Analog Computing-in-Memory (ACIM) addresses this challenge by performing multiply-accumulate (MAC) operations directly in the memory arrays, substantially reducing data movement. However, designing robust ACIM accelerators requires accurate modeling of device- and circuit-level non-idealities. In this work, we present NeuroSim V1.5, introducing several key advances: (1) seamless integration with TensorRT's post-training quantization flow enabling support for more neural networks including transformers, (2) a flexible noise injection methodology built on pre-characterized statistical models, making it straightforward to incorporate data from SPICE simulations or silicon measurements, (3) expanded device support including emerging non-volatile capacitive memories, and (4) up to 6.5x faster runtime than NeuroSim V1.4 through optimized behavioral simulation. The combination of these capabilities uniquely enables systematic design space exploration across both accuracy and hardware efficiency metrics. Through multiple case studies, we demonstrate optimization of critical design parameters while maintaining network accuracy. By bridging high-fidelity noise modeling with efficient simulation, NeuroSim V1.5 advances the design and validation of next-generation ACIM accelerators. All NeuroSim versions are available open-source at https://github.com/neurosim/NeuroSim.
中文: NeuroSim V1.5 增强了模拟存内计算加速器的建模与仿真功能,支持高效的设计空间探索与优化,同时保持神经网络精度。
English: NeuroSim V1.5 introduces enhanced modeling and simulation capabilities for analog computing-in-memory accelerators, enabling efficient design space exploration and optimization while maintaining neural network accuracy.

Authors:Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká
Title: Compositional Image-Text Matching and Retrieval by Grounding Entities
Abstract:
Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5\% improvement in image-text matching accuracy on the Visual Genome and SVO Probes datasets~\cite{krishna2017visualgenome, svo}. Notably, the enhanced embeddings demonstrate superior retrieval performance, thus achieving significant gains on the Flickr30K and MS-COCO retrieval benchmarks~\cite{flickr30ke, mscoco}, improving the state-of-the-art Recall@1 by 12\% and 0.4\%, respectively. Our code is available at https://github.com/madhukarreddyvongala/GroundingCLIP.
中文: 本研究提出一种无需学习的CLIP嵌入增强方法,通过结合开放词汇检测器定位的物体和关系子图像嵌入来改进组合式图文匹配,在多个基准测试中显著提升了准确率。
English: This study introduces a learning-free enhancement to CLIP embeddings that improves compositional image-text matching by incorporating localized object and relation embeddings from open-vocabulary detectors, achieving notable accuracy gains on multiple benchmarks.

Authors:Henry Ndubuaku, Mouad Talhi
Title: Parameter-Efficient Transformer Embeddings
Abstract:
Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
中文: 我们提出的方法用基于傅里叶变换的确定性标记生成和轻量级多层感知机替代传统嵌入层,以更少参数和更快训练实现同等性能,且无需使用dropout技术。
English: Our proposed method replaces traditional embedding layers with a deterministic Fourier-based token generation followed by a lightweight MLP, achieving competitive performance with fewer parameters and faster training while eliminating dropout requirements.

Authors:Xingyu Zheng, Yuye Li, Haoran Chu, Yue Feng, Xudong Ma, Jie Luo, Jinyang Guo, Haotong Qin, Michele Magno, Xianglong Liu
Title: An Empirical Study of Qwen3 Quantization
Abstract:
The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3's performance remains underexplored. This study conducts a systematic evaluation of Qwen3's robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on https://github.com/Efficient-ML/Qwen3-Quantization and https://huggingface.co/collections/Efficient-ML/qwen3-quantization-68164450decb1c868788cb2b.
Chinese: 本研究系统评估了Qwen3模型在不同低比特量化设置下的性能,发现在中等比特宽度下模型保持竞争力,但在超低精度场景中性能显著下降,这凸显了改进压缩技术的必要性。
English: This study systematically evaluates the Qwen3 model's performance under various low-bit quantization settings, revealing that while it maintains competitiveness at moderate bit-widths, it suffers significant degradation in ultra-low precision scenarios, highlighting the need for improved compression techniques.

Authors:Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, Shu Hu
Title: Robust AI-Generated Face Detection with Imbalanced Data
Abstract:
Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at https://github.com/Purdue-M2/SP_CUP.
中文: 基于先进AI的深度伪造技术严重威胁数字信任,尽管检测方法已从局部特征分析发展到全局异常捕捉,但仍面临分布偏移和类别不平衡的挑战,因此提出结合动态损失重加权和排序优化的新框架,以提升泛化能力和检测性能。
English: Deepfakes created with advanced AI pose serious threats to digital trust, and while detection methods have improved, they still struggle with distribution shifts and class imbalance, leading to the proposal of a new framework combining dynamic loss reweighting and ranking-based optimization for better generalization and performance.

Authors:Tao Zhu, Qi Yu, Xinru Dong, Shiyu Li, Yue Liu, Jinlong Jiang, Lei Shu
Title: ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications
Abstract:
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP. Code is available at https://github.com/modadundun/ProDisc-VAD.
中文:ProDisc-VAD是一种高效框架,通过结合基于原型的正常性建模和伪实例对比学习,解决了弱监督视频异常检测中的标签模糊问题,并以极少的参数量实现了领先性能。
English: ProDisc-VAD is an efficient framework that addresses label ambiguity in weakly-supervised video anomaly detection by combining prototype-based normality modeling and pseudo-instance contrastive learning, achieving state-of-the-art performance with minimal parameters.

Authors:Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
Title: Adaptive Thinking via Mode Policy Optimization for Social Language Agents
Abstract:
Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) framework in this paper, aiming to improve the adaptive thinking ability of language agents in dynamic social interactions. To this end, we first identify hierarchical thinking modes ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to optimize the context-aware mode switching and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence benchmarks verify that AML achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter reasoning chains, demonstrating the advantage of adaptive thinking mode selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.
中文: 本文提出自适应模式学习(AML)框架及自适应模式策略优化(AMPO)算法,通过多粒度思维模式设计和情境感知的模式切换机制,显著提升了语言代理在社交互动中的动态推理能力,实验证明其以更短推理链获得优于现有方法的性能表现。
English: This paper introduces an Adaptive Mode Learning (AML) framework with an Adaptive Mode Policy Optimization (AMPO) algorithm to enhance language agents' dynamic reasoning depth in social interactions, achieving superior performance over existing methods through context-aware mode switching and token-efficient processing.

Authors:Oliver Savolainen, Dur e Najaf Amjad, Roxana Petcu
Title: Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions
Abstract:
This reproducibility study analyzes and extends the paper "Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models," which investigates how neural retrieval models encode task-relevant properties such as term frequency. We reproduce key experiments from the original paper, confirming that information on query terms is captured in the model encoding. We extend this work by applying activation patching to Spanish and Chinese datasets and by exploring whether document-length information is encoded in the model as well. Our results confirm that the designed activation patching method can isolate the behavior to specific components and tokens in neural retrieval models. Moreover, our findings indicate that the location of term frequency generalizes across languages and that in later layers, the information for sequence-level tasks is represented in the CLS token. The results highlight the need for further research into interpretability in information retrieval and reproducibility in machine learning research. Our code is available at https://github.com/OliverSavolainen/axiomatic-ir-reproduce.
Chinese: 这项可复现性研究证实了神经检索模型编码词频信息,并将研究扩展至西班牙语和中文数据集,揭示了跨语言泛化特性以及CLS标记在序列级任务中的作用。
English: This reproducibility study confirms that neural retrieval models encode term frequency information and extends the findings to Spanish and Chinese datasets, revealing cross-linguistic generalization and the role of the CLS token in sequence-level tasks.

Authors:Muyao Zhong, Yushi Lin, Peng Yang
Title: Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking
Abstract:
The Limit Order Book (LOB), the mostly fundamental data of the financial market, provides a fine-grained view of market dynamics while poses significant challenges in dealing with the esteemed deep models due to its strong autocorrelation, cross-feature constrains, and feature scale disparity. Existing approaches often tightly couple representation learning with specific downstream tasks in an end-to-end manner, failed to analyze the learned representations individually and explicitly, limiting their reusability and generalization. This paper conducts the first systematic comparative study of LOB representation learning, aiming to identify the effective way of extracting transferable, compact features that capture essential LOB properties. We introduce LOBench, a standardized benchmark with real China A-share market data, offering curated datasets, unified preprocessing, consistent evaluation metrics, and strong baselines. Extensive experiments validate the sufficiency and necessity of LOB representations for various downstream tasks and highlight their advantages over both the traditional task-specific end-to-end models and the advanced representation learning models for general time series. Our work establishes a reproducible framework and provides clear guidelines for future research. Datasets and code will be publicly available at https://github.com/financial-simulation-lab/LOBench.
中文摘要:本文提出了首个用于系统比较限价订单簿表示学习的标准化基准LOBench,验证了其提取可迁移特征的有效性,这些特征在多种金融任务中优于现有模型。
English Summary: This paper introduces LOBench, the first standardized benchmark for systematic comparison of Limit Order Book representation learning, demonstrating its effectiveness in extracting transferable features that outperform existing models across various financial tasks.

Authors:Xiaorui Zhao, Xinyue Zhou, Peibei Cao, Junyu Lou, Shuhang Gu
Title: HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement
Abstract:
Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model's capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at https://github.com/LabShuHangGU/HiLLIE.
中文摘要:本文提出HiLLIE人机交互训练框架,通过少量排序标注和定制化质量评估模型迭代学习人类视觉偏好,显著提升无监督低光照图像增强的视觉质量。
English Summary: This paper introduces HiLLIE, a human-in-the-loop training framework that iteratively improves low-light image enhancement by incorporating human visual preferences through minimal ranking annotations and a tailored quality assessment model.

Authors:Zhong Guan, Likang Wu, Hongke Zhao, Ming He, Jianpin Fan
Title: Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
Abstract:
Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://github.com/millioniron/LLM_exploration}{LLM4Exploration}
中文: 大语言模型的注意力机制因架构限制难以处理图结构数据,无法有效建模节点间关系和适应拓扑特征,但中间态注意力窗口能提升训练效果并平滑过渡到推理阶段。
English: Attention mechanisms in LLMs struggle with graph-structured data due to architectural limitations, failing to model inter-node relationships and adapt to topological nuances, but intermediate attention windows offer improved training and inference performance.

Authors:Xiao Zhou, Zhongxiang Zhao, Hanze Guo
Title: Tricolore: Multi-Behavior User Profiling for Enhanced Candidate Generation in Recommender Systems
Abstract:
Online platforms aggregate extensive user feedback across diverse behaviors, providing a rich source for enhancing user engagement. Traditional recommender systems, however, typically optimize for a single target behavior and represent user preferences with a single vector, limiting their ability to handle multiple important behaviors or optimization objectives. This conventional approach also struggles to capture the full spectrum of user interests, resulting in a narrow item pool during candidate generation. To address these limitations, we present Tricolore, a versatile multi-vector learning framework that uncovers connections between different behavior types for more robust candidate generation. Tricolore's adaptive multi-task structure is also customizable to specific platform needs. To manage the variability in sparsity across behavior types, we incorporate a behavior-wise multi-view fusion module that dynamically enhances learning. Moreover, a popularity-balanced strategy ensures the recommendation list balances accuracy with item popularity, fostering diversity and improving overall performance. Extensive experiments on public datasets demonstrate Tricolore's effectiveness across various recommendation scenarios, from short video platforms to e-commerce. By leveraging a shared base embedding strategy, Tricolore also significantly improves the performance for cold-start users. The source code is publicly available at: https://github.com/abnering/Tricolore.
中文: Tricolore是一种多向量学习框架,通过关联不同用户行为、动态处理稀疏性和平衡流行度,有效提升推荐系统的性能,适用于多种平台场景。
English: Tricolore is a versatile multi-vector learning framework that enhances recommendation systems by connecting diverse user behaviors, dynamically managing sparsity, and balancing popularity for improved performance across various platforms.

Authors:Zeyu Zhang, Quanyu Dai, Xu Chen, Rui Li, Zhongyang Li, Zhenhua Dong
Title: MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents
Abstract:
Recently, large language model based (LLM-based) agents have been widely applied across various fields. As a critical part, their memory capabilities have captured significant interest from both industrial and academic communities. Despite the proposal of many advanced memory models in recent research, however, there remains a lack of unified implementations under a general framework. To address this issue, we develop a unified and modular library for developing advanced memory models of LLM-based agents, called MemEngine. Based on our framework, we implement abundant memory models from recent research works. Additionally, our library facilitates convenient and extensible memory development, and offers user-friendly and pluggable memory usage. For benefiting our community, we have made our project publicly available at https://github.com/nuster1128/MemEngine.
Chinese: MemEngine 是一个为解决基于大语言模型的智能体高级记忆模型缺乏统一框架而开发的模块化库,它支持可扩展开发和即插即用功能,并已公开共享。
English: MemEngine is a unified and modular library developed to address the lack of a general framework for implementing advanced memory models in LLM-based agents, offering extensible development and pluggable usage while being publicly available.

Authors:Joy Lim Jia Yin, Daniel Zhang-Li, Jifan Yu, Haoxuan Li, Shangqing Tu, Yuanchun Wang, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
Title: LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Abstract:
Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at https://github.com/JoylimJY/LecEval.
中文: LecEval基于梅耶多媒体学习认知理论,通过四个评估维度自动评价幻灯片教学效果,并在大规模标注数据集上展现出优于现有方法的准确性和适应性。
English: LecEval introduces an automated evaluation metric based on Mayer's Cognitive Theory to assess slide-based multimedia instruction through four rubrics, demonstrating superior accuracy and adaptability on a large-scale annotated dataset compared to existing methods.

Authors:Volodymyr Havrylov, Haiwen Huang, Dan Zhang, Andreas Geiger
Title: Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation
Abstract:
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at https://github.com/havrylovv/iSegProbe
中文: 视觉基础模型通过任务无关的特征上采样模块提升低分辨率特征,并以交互式分割为基准验证了合适的上采样策略能显著改善特征质量。
English: Vision Foundation Models (VFMs) are enhanced for dense prediction tasks by employing task-agnostic feature upsampling, with interactive segmentation serving as a benchmark to demonstrate that proper upsampling strategies significantly improve feature quality.

Authors:Yi Han
Title: Lightweight Defense Against Adversarial Attacks in Time Series Classification
Abstract:
As time series classification (TSC) gains prominence, ensuring robust TSC models against adversarial attacks is crucial. While adversarial defense is well-studied in Computer Vision (CV), the TSC field has primarily relied on adversarial training (AT), which is computationally expensive. In this paper, five data augmentation-based defense methods tailored for time series are developed, with the most computationally intensive method among them increasing the computational resources by only 14.07% compared to the original TSC model. Moreover, the deployment process for these methods is straightforward. By leveraging these advantages of our methods, we create two combined methods. One of these methods is an ensemble of all the proposed techniques, which not only provides better defense performance than PGD-based AT but also enhances the generalization ability of TSC models. Moreover, the computational resources required for our ensemble are less than one-third of those required for PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as foundation models are increasingly explored for time series feature learning, our work provides insights into integrating data augmentation-based adversarial defense with large-scale pre-trained models in future research.
Chinese: 本文针对时间序列分类提出了五种基于数据增强的高效防御方法,其中集成方法不仅防御性能优于基于PGD的对抗训练,且计算资源需求减少三分之二以上,同时增强了模型的泛化能力。
English: This paper introduces five computationally efficient data augmentation-based defense methods for time series classification, including an ensemble approach that outperforms PGD-based adversarial training with less than one-third of the computational cost while improving model generalization.

Authors:Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu
Title: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Abstract:
Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.
中文: RTV-Bench作为评估多模态大语言模型实时视频分析能力的新基准,通过动态问题设计和多维评估发现现有模型需优化架构以适应动态场景,开源实时模型虽优于离线版本但仍落后于顶尖专有模型。
English: RTV-Bench is a new benchmark designed to evaluate Multimodal Large Language Models' real-time video analysis capabilities through evolving questions and multi-dimensional assessment, revealing that current models need architectural improvements for better performance in dynamic environments.

Authors:Branko Brkljač, Vladimir Kalušev, Branislav Popović, Milan Sečujski
Title: Transforming faces into video stories -- VideoFace2.0
Abstract:
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. Presented results are based on test implementation that achieves between 18-25 fps on consumer type notebook. Ablation experiments also confirmed that the proposed algorithm brings relative gain in the reduction of number of false identities in the range of 73%-93%. We hope that the presented work and shared code implementation will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal datasets in the future.
Chinese: VideoFace2.0系统是一种先进的视频分析工具,通过结合人脸检测、识别和追踪技术实现实时人脸重识别,能够高效创建结构化视频输出,适用于电视制作、媒体分析和机器学习数据集构建等场景。
English: The VideoFace2.0 system is an advanced video analytics tool that performs real-time face re-identification by combining face detection, recognition, and tracking, enabling efficient cataloging and structured video output for applications in TV production, media analysis, and machine learning dataset creation.

Authors:Rui Lv, Zaixi Zhang, Kai Zhang, Qi Liu, Weibo Gao, Jiawei Liu, Jiaxia Yan, Linan Yue, Fangzhou Yao
Title: GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning
Abstract:
Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the $k$-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at https://github.com/karin0018/GraphPrompter.
中文摘要:GraphPrompter提出了一种多阶段自适应提示优化方法,通过生成、选择和增强图提示来提升图上下文学习能力,相比现有方法性能平均提升超过8%。
English Summary: GraphPrompter introduces a multi-stage adaptive prompt optimization method to enhance graph in-context learning by generating, selecting, and augmenting prompts, achieving over 8% performance improvement over existing methods.

Authors:Yancheng Chen, Wenguo Yang, Zhipeng Jiang
Title: Wide & Deep Learning for Node Classification
Abstract:
Wide & Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide & Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at https://github.com/CYCUCAS/GCNIII.
中文: 本文提出GCNIII框架,结合Wide & Deep架构与三项技术,有效平衡节点分类任务中的过拟合与泛化问题,并探索使用大语言模型优化特征工程。
English: This paper introduces GCNIII, a flexible framework that integrates the Wide & Deep architecture with three techniques to balance over-fitting and over-generalization in node classification tasks, while also exploring LLMs for feature enhancement.

Authors:Zeyuan Ma, Zhiguang Cao, Zhou Jiang, Hongshu Guo, Yue-Jiao Gong
Title: Meta-Black-Box-Optimization through Offline Q-function Learning
Abstract:
Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic. To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba at https://github.com/MetaEvo/Q-Mamba.
中文摘要:本文提出Q-Mamba离线元黑盒优化框架,通过将动态算法配置转化为长序列决策过程,并采用平衡探索的数据收集策略、分解Q学习机制及Mamba架构,在保证性能的同时显著提升了训练效率。
English Summary: The paper introduces Q-Mamba, an offline MetaBBO framework that enhances both effectiveness and efficiency by transforming dynamic algorithm configuration into a long-sequence decision process and incorporating novel designs including a balanced dataset collection strategy, decomposition-based Q-learning, and Mamba architecture for improved learning.

Authors:Zhenxing Mi, Ping Yin, Xue Xiao, Dan Xu
Title: Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields
Abstract:
Recent NeRF methods on large-scale scenes have underlined the importance of scene decomposition for scalable NeRFs. Although achieving reasonable scalability, there are several critical problems remaining unexplored, i.e., learnable decomposition, modeling scene heterogeneity, and modeling efficiency. In this paper, we introduce Switch-NeRF++, a Heterogeneous Mixture of Hash Experts (HMoHE) network that addresses these challenges within a unified framework. It is a highly scalable NeRF that learns heterogeneous decomposition and heterogeneous NeRFs efficiently for large-scale scenes in an end-to-end manner. In our framework, a gating network learns to decompose scenes and allocates 3D points to specialized NeRF experts. This gating network is co-optimized with the experts by our proposed Sparsely Gated Mixture of Experts (MoE) NeRF framework. We incorporate a hash-based gating network and distinct heterogeneous hash experts. The hash-based gating efficiently learns the decomposition of the large-scale scene. The distinct heterogeneous hash experts consist of hash grids of different resolution ranges, enabling effective learning of the heterogeneous representation of different scene parts. These design choices make our framework an end-to-end and highly scalable NeRF solution for real-world large-scale scene modeling to achieve both quality and efficiency. We evaluate our accuracy and scalability on existing large-scale NeRF datasets and a new dataset with very large-scale scenes ($>6.5km^2$) from UrbanBIS. Extensive experiments demonstrate that our approach can be easily scaled to various large-scale scenes and achieve state-of-the-art scene rendering accuracy. Furthermore, our method exhibits significant efficiency, with an 8x acceleration in training and a 16x acceleration in rendering compared to Switch-NeRF. Codes will be released at https://github.com/MiZhenxing/Switch-NeRF.
中文: Switch-NeRF++提出了一种异构哈希专家混合框架,能高效学习大规模场景的分解和异构表示,在实现顶尖渲染精度的同时显著提升了训练和渲染速度。
English: Switch-NeRF++ introduces a Heterogeneous Mixture of Hash Experts framework that efficiently learns scene decomposition and heterogeneous representations for large-scale scenes, achieving state-of-the-art rendering accuracy with significant training and rendering speed improvements.

Authors:Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Title: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection
Abstract:
3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1\% and 9.3\% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https://github.com/iCAN-SZU/MC3D-AD.
中文: 本文提出了一种统一的多类别三维异常检测模型,通过结合局部和全局几何感知信息重构各类别的正常表示,在基准数据集上相比现有单类别方法展现出显著优越性能。
English: This paper introduces a unified multi-category 3D anomaly detection model that leverages local and global geometry-aware information to reconstruct normal representations across categories, demonstrating superior performance over existing single-category methods on benchmark datasets.

Authors:Leyi Yan, Linda Wang, Sihang Liu, Yi Ding
Title: EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting
Abstract:
Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these limitations, we introduce EnsembleCI, an adaptive, end-to-end ensemble learning-based approach for CI forecasting. EnsembleCI combines weighted predictions from multiple sublearners, offering enhanced flexibility and regional adaptability. In evaluations across 11 regional grids, EnsembleCI consistently surpasses CarbonCast, achieving the lowest mean absolute percentage error (MAPE) in almost all grids and improving prediction accuracy by an average of 19.58%. While performance still varies across grids due to inherent regional diversity, EnsembleCI reduces variability and exhibits greater robustness in long-term forecasting compared to CarbonCast and identifies region-specific key features, underscoring its interpretability and practical relevance. These findings position EnsembleCI as a more accurate and reliable solution for CI forecasting. EnsembleCI source code and data used in this paper are available at https://github.com/emmayly/EnsembleCI.
中文: EnsembleCI是一种自适应集成学习方法,相比CarbonCast将碳强度预测准确率平均提高19.58%,在区域适应性和鲁棒性方面表现更优。
English: EnsembleCI is an adaptive ensemble learning method that improves carbon intensity forecasting accuracy by 19.58% on average and demonstrates superior regional adaptability and robustness compared to CarbonCast.

Authors:Qi Yang, Le Yang, Geert Van Der Auwera, Zhu Li
Title: HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder
Abstract:
Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed. The code is publicly available at https://github.com/Qi-Yangsjtu/HybridGS.
Chinese: HybridGS提出了一种新的3DGS压缩框架,通过结合紧凑生成和标准化点云编码,在保持重建质量的同时显著提升了编解码速度。
English: HybridGS introduces a novel 3DGS compression framework that combines compact generation with standardized point cloud encoding, achieving comparable reconstruction quality and faster processing speeds than current methods.

Authors:Xingyu Miao, Haoran Duan, Yang Long, Jungong Han
Title: Rethinking Score Distilling Sampling for 3D Editing and Generation
Abstract:
Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: https://github.com/xingy038/UDS.
中文: UDS通过优化SDS中的梯度项,统一了3D生成与编辑功能,在两项任务中均超越基线方法,呈现更丰富的细节。
English: UDS unifies 3D generation and editing by refining gradient terms in SDS, outperforming baselines in both tasks with richer details.

Authors:Siddharth Kothari, Srinivasan Murali, Sankalp Kothari, Ujjwal Verma, Jaya Sreevalsan-Nair
Title: Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images
Abstract:
Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (GitHub link - https://github.com/GVCL/IWSeg-SAR-Poison.git)
中文摘要:本研究通过对抗性攻击模拟人工标注错误,探讨了U-Net模型在SAR图像内陆水域分割中的鲁棒性,发现模型虽能承受一定程度的数据污染,但标注质量对分割效果具有决定性影响。
English Summary: This study investigates the robustness of the U-Net model for inland water segmentation in SAR images against simulated human annotation errors introduced through adversarial attacks, finding that while the model can tolerate some corruption, annotation quality critically impacts segmentation performance.

Authors:Jiakun Yan, Marc Snir
Title: LCI: a Lightweight Communication Interface for Efficient Asynchronous Multithreaded Communication
Abstract:
The evolution of architectures, programming models, and algorithms is driving communication towards greater asynchrony and concurrency, usually in multithreaded environments. We present LCI, a communication library designed for efficient asynchronous multithreaded communication. LCI provides a concise interface that supports common point-to-point primitives and diverse completion mechanisms, along with flexible controls for incrementally fine-tuning communication resources and runtime behavior. It features a threading-efficient runtime built on atomic data structures, fine-grained non-blocking locks, and low-level network insights. We evaluate LCI on both Infiniband and Slingshot-11 clusters with microbenchmarks and two application-level benchmarks. Experimental results show that LCI significantly outperforms existing communication libraries in various multithreaded scenarios, achieving performance that exceeds the traditional multi-process execution mode and unlocking new possibilities for emerging programming models and applications. LCI is open-source and available at https://github.com/uiuc-hpc/lci.
中文: LCI是一个高效的异步多线程通信库,在多线程环境中显著优于现有解决方案,并为新兴编程模型开辟了新可能。
English: LCI is an efficient asynchronous multithreaded communication library that outperforms existing solutions in multithreaded environments and enables new programming possibilities.

Authors:Anthony Nguyen, Wenjun Lin
Title: Intra-Layer Recurrence in Transformers for Language Modeling
Abstract:
Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
中文摘要:层内循环(ILR)通过在单次前向传播中选择性地对单个层应用循环机制,实验表明优先对早期层进行迭代可获得最佳性能。
English Summary: Intra-Layer Recurrence (ILR) selectively applies recurrence to individual transformer layers within a single forward pass, with experiments showing optimal performance when prioritizing earlier layers for iteration.

Authors:Janak Kapuriya, Manit Kaushik, Debasis Ganguly, Sumit Bhatia
Title: Exploring the Role of Diversity in Example Selection for In-Context Learning
Abstract:
In-Context Learning (ICL) has gained prominence due to its ability to perform tasks without requiring extensive training data and its robustness to noisy labels. A typical ICL workflow involves selecting localized examples relevant to a given input using sparse or dense embedding-based similarity functions. However, relying solely on similarity-based selection may introduce topical biases in the retrieved contexts, potentially leading to suboptimal downstream performance. We posit that reranking the retrieved context to enhance topical diversity can improve downstream task performance. To achieve this, we leverage maximum marginal relevance (MMR) which balances topical similarity with inter-example diversity. Our experimental results demonstrate that diversifying the selected examples leads to consistent improvements in downstream performance across various context sizes and similarity functions. The implementation of our approach is made available at https://github.com/janak11111/Diverse-ICL.
中文: 通过最大边际相关性重排来多样化上下文学习示例,在平衡主题相似性与多样性的基础上,能有效提升下游任务性能,多种实验设置均验证了该方法的有效性。
English: Diversifying in-context learning examples through maximum marginal relevance reranking improves downstream task performance by balancing topical similarity and diversity, as demonstrated across various settings.

Authors:Jiesong Bai, Yuhao Yin, Yihang Dong, Xiaofeng Zhang, Chi-Man Pun, Xuhang Chen
Title: LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction
Abstract:
Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet's robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https://github.com/baijiesong/Lensnet.
中文: LensNet是一种端到端的深度学习框架,通过可学习的编码掩模模拟器动态估计点扩散函数,结合空间和频域表示,在无透镜成像中实现了卓越的重建质量和降噪效果。
English: LensNet is an end-to-end deep learning framework that integrates spatial and frequency domains with a learnable coded mask simulator for dynamic PSF estimation, achieving superior reconstruction quality and noise reduction in lensless imaging.

Authors:Wolfgang Gritz, Hewi Salih, Anett Hoppe, Ralph Ewerth
Title: From Formulas to Figures: How Visual Elements Impact User Interactions in Educational Videos
Abstract:
Educational videos have become increasingly relevant in today's learning environments. While prior research in laboratory studies has provided valuable insights, analyzing real-world interaction data can enhance our understanding of authentic user behavior. Previous studies have investigated technical aspects, such as the influence of cuts on pausing behavior, but the impact of visual complexity remains understudied. In this paper, we address this gap and propose a novel approach centered on visual complexity, defined as the number of visually distinguishable and meaningful elements in a video frame, such as mathematical equations, chemical formulas, or graphical representations. Our study introduces a fine-grained taxonomy of visual objects in educational videos, expanding on previous classifications. Applying this taxonomy to 25 videos from physics and chemistry, we examine the relationship between visual complexity and user behavior, including pauses, in-video navigation, and session dropouts. The results indicate that increased visual complexity, especially of textual elements, correlates with more frequent pauses, rewinds, and dropouts. The results offer a deeper understanding of how video design affects user behavior in real-world scenarios. Our work has implications for optimizing educational videos, particularly in STEM fields. We make our code publicly available (https://github.com/TIBHannover/from_formulas_to_figures).
中文摘要:本研究探讨了教育视频中视觉复杂度对用户行为的影响,发现复杂度增加,尤其是文本元素,会导致更多暂停、回放和退出行为,为优化视频设计提供了参考。
English Summary: This study investigates how visual complexity in educational videos, particularly in STEM subjects, influences user behavior by correlating increased complexity with more frequent pauses, rewinds, and dropouts, offering insights for optimizing video design.

Authors:Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Chong Liu, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu
Title: PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking
Abstract:
Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX.
中文: PoseX是一个开源基准测试,旨在评估自对接和交叉对接方法,整合了新颖数据集和23种不同对接方法,关键发现显示AI方法优于物理方法,并通过物理后处理显著提升性能。
English: PoseX is an open-source benchmark designed to evaluate both self-docking and cross-docking methods, incorporating a novel dataset and 23 diverse docking approaches, with key findings showing AI methods outperform physics-based ones and benefit from physics-based post-processing.

Authors:Yifan Liu, Ruichen Yao, Yaokun Liu, Ruohan Zong, Zelin Li, Yang Zhang, Dong Wang
Title: Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning
Abstract:
The widespread integration of face recognition technologies into various applications (e.g., access control and personalized advertising) necessitates a critical emphasis on fairness. While previous efforts have focused on demographic fairness, the fairness of individual biological face components remains unexplored. In this paper, we focus on face component fairness, a fairness notion defined by biological face features. To our best knowledge, our work is the first work to mitigate bias of face attribute prediction at the biological feature level. In this work, we identify two key challenges in optimizing face component fairness: attribute label scarcity and attribute inter-dependencies, both of which limit the effectiveness of bias mitigation from previous approaches. To address these issues, we propose \textbf{B}ayesian \textbf{N}etwork-informed \textbf{M}eta \textbf{R}eweighting (BNMR), which incorporates a Bayesian Network calibrator to guide an adaptive meta-learning-based sample reweighting process. During the training process of our approach, the Bayesian Network calibrator dynamically tracks model bias and encodes prior probabilities for face component attributes to overcome the above challenges. To demonstrate the efficacy of our approach, we conduct extensive experiments on a large-scale real-world human face dataset. Our results show that BNMR is able to consistently outperform recent face bias mitigation baselines. Moreover, our results suggest a positive impact of face component fairness on the commonly considered demographic fairness (e.g., \textit{gender}). Our findings pave the way for new research avenues on face component fairness, suggesting that face component fairness could serve as a potential surrogate objective for demographic fairness. The code for our work is publicly available~\footnote{https://github.com/yliuaa/BNMR-FairCompFace.git}.
中文: 本文提出BNMR方法,首次从生物特征层面解决人脸组件公平性问题,通过贝叶斯网络校准和元学习加权机制有效提升人脸属性预测的公平性,并证实其对人口统计公平性的积极影响。
English: This paper introduces BNMR, a novel method that addresses fairness in face recognition by focusing on biological face components, effectively overcoming challenges like attribute scarcity and dependencies to enhance both component and demographic fairness.

Authors:Yuying Zhao, Yu Wang, Xueqi Cheng, Anne Marie Tumlin, Yunchao Liu, Damin Xia, Meng Jiang, Tyler Derr
Title: Amplifying Your Social Media Presence: Personalized Influential Content Generation with LLMs
Abstract:
The remarkable advancements in Large Language Models (LLMs) have revolutionized the content generation process in social media, offering significant convenience in writing tasks. However, existing applications, such as sentence completion and fluency enhancement, do not fully address the complex challenges in real-world social media contexts. A prevalent goal among social media users is to increase the visibility and influence of their posts. This paper, therefore, delves into the compelling question: Can LLMs generate personalized influential content to amplify a user's presence on social media? We begin by examining prevalent techniques in content generation to assess their impact on post influence. Acknowledging the critical impact of underlying network structures in social media, which are instrumental in initiating content cascades and highly related to the influence/popularity of a post, we then inject network information into prompt for content generation to boost the post's influence. We design multiple content-centric and structure-aware prompts. The empirical experiments across LLMs validate their ability in improving the influence and draw insights on which strategies are more effective. Our code is available at https://github.com/YuyingZhao/LLM-influence-amplifier.
中文摘要:本研究探讨大型语言模型能否通过融入网络信息生成个性化有影响力内容以增强社交媒体存在感,实验验证了其在提升帖子影响力方面的有效性。
English Summary: This study explores whether Large Language Models can generate personalized influential content to enhance social media presence by incorporating network information into prompts, with experiments validating their effectiveness in boosting post influence.

Authors:Yuying Zhao, Xiaodong Yang, Huiyuan Chen, Xiran Fan, Yu Wang, Yiwei Cai, Tyler Derr
Title: SimAug: Enhancing Recommendation with Pretrained Language Models for Dense and Balanced Data Augmentation
Abstract:
Deep Neural Networks (DNNs) are extensively used in collaborative filtering due to their impressive effectiveness. These systems depend on interaction data to learn user and item embeddings that are crucial for recommendations. However, the data often suffers from sparsity and imbalance issues: limited observations of user-item interactions can result in sub-optimal performance, and a predominance of interactions with popular items may introduce recommendation bias. To address these challenges, we employ Pretrained Language Models (PLMs) to enhance the interaction data with textual information, leading to a denser and more balanced dataset. Specifically, we propose a simple yet effective data augmentation method (SimAug) based on the textual similarity from PLMs, which can be seamlessly integrated to any systems as a lightweight, plug-and-play component in the pre-processing stage. Our experiments across nine datasets consistently demonstrate improvements in both utility and fairness when training with the augmented data generated by SimAug. The code is available at https://github.com/YuyingZhao/SimAug.
中文: 针对协同过滤中深度神经网络因数据稀疏和不平衡导致的性能问题,我们利用预训练语言模型提出了一种基于文本相似度的数据增强方法SimAug,该方法作为轻量级插件有效提升了多数据集的推荐效果与公平性。
English: Deep Neural Networks in collaborative filtering face challenges from sparse and imbalanced data, which are addressed by using Pretrained Language Models for text-based data augmentation through a method called SimAug, leading to improved performance and fairness across multiple datasets.

Authors:Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee
Title: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
Abstract:
Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine
Chinese: 本文对25个大语言模型推理引擎进行全面评估,分析其性能、设计目标和生态成熟度,为研究人员和开发者选择及优化适用于多样化服务需求的系统提供实用指导。
English: This paper conducts a comprehensive evaluation of 25 LLM inference engines, assessing their performance, design goals, and ecosystem maturity to guide researchers and developers in selecting and optimizing these systems for diverse service requirements.

Authors:Jun Li, Yijue Zhang, Haibo Shi, Minhong Li, Qiwei Li, Xiaohua Qian
Title: A Dual-Task Synergy-Driven Generalization Framework for Pancreatic Cancer Segmentation in CT Scans
Abstract:
Pancreatic cancer, characterized by its notable prevalence and mortality rates, demands accurate lesion delineation for effective diagnosis and therapeutic interventions. The generalizability of extant methods is frequently compromised due to the pronounced variability in imaging and the heterogeneous characteristics of pancreatic lesions, which may mimic normal tissues and exhibit significant inter-patient variability. Thus, we propose a generalization framework that synergizes pixel-level classification and regression tasks, to accurately delineate lesions and improve model stability. This framework not only seeks to align segmentation contours with actual lesions but also uses regression to elucidate spatial relationships between diseased and normal tissues, thereby improving tumor localization and morphological characterization. Enhanced by the reciprocal transformation of task outputs, our approach integrates additional regression supervision within the segmentation context, bolstering the model's generalization ability from a dual-task perspective. Besides, dual self-supervised learning in feature spaces and output spaces augments the model's representational capability and stability across different imaging views. Experiments on 594 samples composed of three datasets with significant imaging differences demonstrate that our generalized pancreas segmentation results comparable to mainstream in-domain validation performance (Dice: 84.07%). More importantly, it successfully improves the results of the highly challenging cross-lesion generalized pancreatic cancer segmentation task by 9.51%. Thus, our model constitutes a resilient and efficient foundational technological support for pancreatic disease management and wider medical applications. The codes will be released at https://github.com/SJTUBME-QianLab/Dual-Task-Seg.
中文摘要:本研究提出一种结合像素级分类与回归的双任务泛化框架,通过协同优化提升胰腺病灶分割精度,在跨数据集验证中显著提升模型稳定性与分割性能。
English Summary: This study introduces a dual-task generalization framework combining pixel-level classification and regression to enhance pancreatic lesion segmentation, improving model stability and achieving significant performance gains in cross-dataset validation.

Authors:Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R. Lebeck, Danyang Zhuo
Title: Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation
Abstract:
Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.
Chinese: Phantora是一种混合GPU集群模拟器,通过在容器化环境中直接运行未经修改的机器学习框架,为训练工作负载提供高精度性能评估,无需重新实现框架代码,且能在单GPU上支持最新的大语言模型训练框架。
English: Phantora is a hybrid GPU cluster simulator that enables high-fidelity performance estimation for ML training workloads by executing unmodified ML frameworks in a containerized environment, eliminating the need for reimplementation and supporting state-of-the-art frameworks with minimal resource requirements.

Authors:Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R. Lebeck, Danyang Zhuo
Title: Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation
Abstract:
Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.
Chinese: Phantora是一种混合GPU集群模拟器,通过在容器化环境中直接运行未经修改的机器学习框架,为训练工作负载提供高精度性能评估,无需重新实现框架代码,且能在单GPU上支持最新的大语言模型训练框架。
English: Phantora is a hybrid GPU cluster simulator that enables high-fidelity performance estimation for ML training workloads by executing unmodified ML frameworks in a containerized environment, eliminating the need for reimplementation and supporting state-of-the-art frameworks with minimal resource requirements.

Authors:Abdalwahab Almajed, Maryam Tabar, Peyman Najafirad
Title: Machine Learning Fairness in House Price Prediction: A Case Study of America's Expanding Metropolises
Abstract:
As a basic human need, housing plays a key role in enhancing health, well-being, and educational outcome in society, and the housing market is a major factor for promoting quality of life and ensuring social equity. To improve the housing conditions, there has been extensive research on building Machine Learning (ML)-driven house price prediction solutions to accurately forecast the future conditions, and help inform actions and policies in the field. In spite of their success in developing high-accuracy models, there is a gap in our understanding of the extent to which various ML-driven house price prediction approaches show ethnic and/or racial bias, which in turn is essential for the responsible use of ML, and ensuring that the ML-driven solutions do not exacerbate inequity. To fill this gap, this paper develops several ML models from a combination of structural and neighborhood-level attributes, and conducts comprehensive assessments on the fairness of ML models under various definitions of privileged groups. As a result, it finds that the ML-driven house price prediction models show various levels of bias towards protected attributes (i.e., race and ethnicity in this study). Then, it investigates the performance of different bias mitigation solutions, and the experimental results show their various levels of effectiveness on different ML-driven methods. However, in general, the in-processing bias mitigation approach tends to be more effective than the pre-processing one in this problem domain. Our code is available at https://github.com/wahab1412/housing_fairness.
Chinese: 本研究通过开发揭示种族和民族相关偏见的机器学习房价预测模型,评估了不同偏见缓解方法的有效性,发现在该问题领域中,处理中方法通常比预处理方法更为有效。
English: This study addresses the fairness gap in machine learning-driven house price prediction by developing models that reveal biases related to race and ethnicity, and evaluating the effectiveness of different bias mitigation approaches, with in-processing methods proving generally more effective than pre-processing ones.

Authors:Stefanos Gkikas, Raul Fernandez Rojas, Manolis Tsiknakis
Title: PainFormer: a Vision Foundation Model for Automatic Pain Assessment
Abstract:
Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities - including RGB, synthetic thermal, and estimated depth videos - and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 75 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment. The foundation model's architecture (code) and weights are available at: https://github.com/GkikasStefanos/PainFormer.
中文: 本研究提出PainFormer多任务视觉基础模型,通过14个数据集训练,能从行为和生理输入中提取高质量嵌入表征,在单模态与多模态场景下均实现了自动疼痛评估的最先进性能。
English: This study introduces PainFormer, a multi-task vision foundation model trained on 14 datasets to extract high-quality embeddings from behavioral and physiological inputs, achieving state-of-the-art performance in automatic pain assessment across unimodal and multimodal settings.

Authors:Stefanos Gkikas, Raul Fernandez Rojas, Manolis Tsiknakis
Title: PainFormer: a Vision Foundation Model for Automatic Pain Assessment
Abstract:
Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities - including RGB, synthetic thermal, and estimated depth videos - and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 75 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment. The foundation model's architecture (code) and weights are available at: https://github.com/GkikasStefanos/PainFormer.
中文: 本研究提出PainFormer多任务视觉基础模型,通过14个数据集训练,能从行为和生理输入中提取高质量嵌入表征,在单模态与多模态场景下均实现了自动疼痛评估的最先进性能。
English: This study introduces PainFormer, a multi-task vision foundation model trained on 14 datasets to extract high-quality embeddings from behavioral and physiological inputs, achieving state-of-the-art performance in automatic pain assessment across unimodal and multimodal settings.

Authors:Zhen Yao, Xiaowen Ying, Zhiyu Zhu, Mooi Choo Chuah
Title: Learning Flow-Guided Registration for RGB-Event Semantic Segmentation
Abstract:
Event cameras capture microsecond-level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB-Event perception as a fusion problem is ill-posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB-X sensing domains. To tackle these limitations, we recast RGB-Event segmentation from fusion to registration. We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse-grained guide, along with fine-grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion-Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large-scale datasets validate our approach, establishing flow-guided registration as a promising direction for RGB-Event segmentation. Our code is available at: https://github.com/zyaocoder/BRENet.
中文: 本研究将RGB-事件分割重新定义为配准问题而非融合,提出BRENet框架通过流引导的双向匹配和运动增强事件张量,有效弥合了模态间的差异。
English: The study reframes RGB-Event segmentation as a registration problem instead of fusion, introducing BRENet with flow-guided bidirectional matching and a Motion-Enhanced Event Tensor to bridge modality gaps effectively.

Authors:Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber
Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Abstract:
Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs' abnormality detection abilities with examples across alignment, consistency, commonsense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen2.5-VL, Video-R1, and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive commonsense and physics reasoning over real and synthetic videos, improves MLLMs' abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of commonsense and physical laws. Our code is available at https://github.com/zli12321/VideoHallu.git.
中文: VideoHallu基准通过合成视频异常评估多模态大模型的批判性思维能力,发现尽管在真实场景表现优异,模型仍难以处理常识和物理推理,但采用GRPO的针对性训练能显著提升这方面的理解能力。
English: The VideoHallu benchmark evaluates MLLMs' critical thinking on synthetic video abnormalities, revealing their struggles with commonsense and physics reasoning despite strong real-world performance, while demonstrating that targeted training with GRPO significantly enhances these capabilities.

Authors:Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu
Title: CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering
Abstract:
Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
Chinese: 本文提出CostFilter-AD方法,通过构建并过滤输入图像与正常样本间的匹配成本,有效抑制噪声同时保留边缘结构和捕捉细微异常,可作为通用后处理模块提升各类无监督异常检测方法的性能。
English: The paper introduces CostFilter-AD, a novel unsupervised anomaly detection method that refines anomaly localization by filtering matching costs between input and normal samples, effectively reducing noise while enhancing edge and subtle anomaly detection across various UAD approaches.

Authors:Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu
Title: CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering
Abstract:
Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
Chinese: 本文提出CostFilter-AD方法,通过构建并过滤输入图像与正常样本间的匹配成本,有效抑制噪声同时保留边缘结构和捕捉细微异常,可作为通用后处理模块提升各类无监督异常检测方法的性能。
English: The paper introduces CostFilter-AD, a novel unsupervised anomaly detection method that refines anomaly localization by filtering matching costs between input and normal samples, effectively reducing noise while enhancing edge and subtle anomaly detection across various UAD approaches.

Authors:Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
Title: Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Abstract:
LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
中文: 大型多模态语言模型存在记忆敏感信息的风险,可通过多模态提示提取,为此开发了UnLOK-VQA基准来评估定向遗忘方法,结果显示多模态攻击更有效,而更大模型具有更强的安全鲁棒性。
English: Large multimodal language models risk memorizing sensitive data, which can be extracted via multimodal prompts, prompting the development of UnLOK-VQA as a benchmark to evaluate targeted unlearning methods and revealing that multimodal attacks are more effective while larger models offer better safety.

Authors:Chaoyi Wang, Junjie Zheng, Zihao Chen, Shiyu Xia, Chaofan Ding, Xiaohao Zhang, Xi Tao, Xiaoming He, Xinhan Di
Title: Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks
Abstract:
Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the complexities of dialogue, narration, monologue, and actor adaptability in movie dubbing. 2) A practical evaluation system should offer valuable insights to improve movie dubbing quality and advancement in film production. To this end, we introduce Talking Adaptive Dubbing Benchmarks (TA-Dubbing), designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing. TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation. 2) Versatile Benchmarking: TA-Dubbing is designed to evaluate state-of-the-art movie dubbing models and advanced multi-modal large language models. 3) Full Open-Sourcing: We fully open-source TA-Dubbing at https://github.com/woka- 0a/DeepDubber- V1 including all video suits, evaluation methods, annotations. We also continuously integrate new movie dubbing models into the TA-Dubbing leaderboard at https://github.com/woka- 0a/DeepDubber-V1 to drive forward the field of movie dubbing.
Chinese: TA-Dubbing基准通过全面评估电影配音中的对话、旁白、独白和演员适应性,弥补现有指标的不足,并借助开源工具和持续模型集成推动电影制作质量的提升。
English: The TA-Dubbing benchmark is introduced to address the limitations of existing metrics by comprehensively evaluating movie dubbing across dialogue, narration, monologue, and actor adaptability, aiming to enhance film production through open-source tools and continuous model integration.

Authors:Carlo Siebenschuh, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Arham Khan, Khalid Hossain, Yadu Babuji, Nicholas Chia, Venkatram Vishwanath, Rick Stevens, Arvind Ramanathan, Ian Foster, Robert Underwood
Title: AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Abstract:
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
中文摘要:AdaParse是一种自适应PDF解析引擎,通过整合人类偏好和优化资源分配,为每份科学文献智能匹配最适合的解析器,在大规模解析任务中实现17倍吞吐量提升并保持相当的准确率。
English Summary: AdaParse is an adaptive engine that efficiently assigns the most suitable PDF parser to each scientific document by incorporating human preferences and optimizing resource allocation, achieving 17 times higher throughput with comparable accuracy for large-scale parsing.

Authors:Wenqi Guo, Mohamed Shehata, Shan Du
Title: ZS-VCOS: Zero-Shot Video Camouflaged Object Segmentation By Optical Flow and Open Vocabulary Object Detection
Abstract:
Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, due to the similarity of the camouflaged object and the background. This work studies how to avoid training by integrating large pre-trained models like SAM-2 and Owl-v2 with temporal information into a modular pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ($F_β^w$) from 0.296 to 0.628. Our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. Besides our main contributions, we also highlight inconsistencies in previous work regarding metrics and settings. Code can be found in https://github.com/weathon/vcos.
中文: 本研究提出了一种新颖的零样本伪装物体分割方法,通过将大型预训练模型与时间信息相结合,在基准数据集上相比现有方法取得了显著的性能提升。
English: This study introduces a novel zero-shot approach for camouflaged object segmentation by integrating large pre-trained models with temporal information, achieving significant performance improvements over existing methods on benchmark datasets.

Authors:Rahuul Rangaraj, Jimeng Shi, Azam Shirali, Rajendra Paudel, Yanzhao Wu, Giri Narasimhan
Title: How Effective are Large Time Series Models in Hydrology? A Study on Water Level Forecasting in Everglades
Abstract:
The Everglades play a crucial role in flood and drought regulation, water resource planning, and ecosystem management in the surrounding regions. However, traditional physics-based and statistical methods for predicting water levels often face significant challenges, including high computational costs and limited adaptability to diverse or unforeseen conditions. Recent advancements in large time series models have demonstrated the potential to address these limitations, with state-of-the-art deep learning and foundation models achieving remarkable success in time series forecasting across various domains. Despite this progress, their application to critical environmental systems, such as the Everglades, remains underexplored. In this study, we fill the gap by investigating twelve task-specific models and five time series foundation models across six categories for a real-world application focused on water level prediction in the Everglades. Our primary results show that the foundation model Chronos significantly outperforms all other models while the remaining foundation models exhibit relatively poor performance. We also noticed that the performance of task-specific models varies with the model architectures, and discussed the possible reasons. We hope our study and findings will inspire the community to explore the applicability of large time series models in hydrological applications. The code and data are available at https://github.com/rahuul2992000/Everglades-Benchmark.
中文: 本研究评估了用于大沼泽地水位预测的先进时间序列模型,发现Chronos基础模型表现显著优于其他模型,同时强调了在水利学中进一步探索大型模型的必要性。
English: This study evaluates advanced time series models for water level prediction in the Everglades, finding that the Chronos foundation model significantly outperforms others, while highlighting the need for further exploration of large models in hydrology.

Authors:Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr
Title: VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models
Abstract:
The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}
中文摘要:VIDSTAMP是一种新型水印框架,通过在视频扩散模型的潜在空间中直接嵌入高容量水印,在保持与未加水印视频相当的视觉质量的同时,实现了对篡改的更强鲁棒性。
English Summary: VIDSTAMP is a novel watermarking framework that embeds high-capacity watermarks directly into video diffusion models' latent space, achieving superior robustness against tampering while maintaining visual quality comparable to unwatermarked videos.

Authors:Fahong Zhang, Yilei Shi, Xiao Xiang Zhu
Title: Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing
Abstract:
This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at https://github.com/zhu-xlab.
本文提出了一种新颖的全局共线性感知多边形化算法(GCP),通过基于变换器的回归模块和动态规划技术优化遥感图像中的建筑物轮廓,最终生成精确的多边形表示。
This paper introduces the Global Collinearity-aware Polygonizer (GCP), a novel algorithm that refines and simplifies building contours from remote sensing images using transformer-based regression and dynamic programming to produce accurate polygonal representations.

Authors:Dan Barry, Davoud Shariat Panah, Alessandro Ragano, Jan Skoglund, Andrew Hines
Title: Binamix -- A Python Library for Generating Binaural Audio Datasets
Abstract:
The increasing demand for spatial audio in applications such as virtual reality, immersive media, and spatial audio research necessitates robust solutions to generate binaural audio data sets for use in testing and validation. Binamix is an open-source Python library designed to facilitate programmatic binaural mixing using the extensive SADIE II Database, which provides Head Related Impulse Response (HRIR) and Binaural Room Impulse Response (BRIR) data for 20 subjects. The Binamix library provides a flexible and repeatable framework for creating large-scale spatial audio datasets, making it an invaluable resource for codec evaluation, audio quality metric development, and machine learning model training. A range of pre-built example scripts, utility functions, and visualization plots further streamline the process of custom pipeline creation. This paper presents an overview of the library's capabilities, including binaural rendering, impulse response interpolation, and multi-track mixing for various speaker layouts. The tools utilize a modified Delaunay triangulation technique to achieve accurate HRIR/BRIR interpolation where desired angles are not present in the data. By supporting a wide range of parameters such as azimuth, elevation, subject Impulse Responses (IRs), speaker layouts, mixing controls, and more, the library enables researchers to create large binaural datasets for any downstream purpose. Binamix empowers researchers and developers to advance spatial audio applications with reproducible methodologies by offering an open-source solution for binaural rendering and dataset generation. We release the library under the Apache 2.0 License at https://github.com/QxLabIreland/Binamix/
Binamix 是一个开源 Python 库,利用 SADIE II 数据库实现程序化双耳混音,为沉浸式音频应用的研究开发提供了可生成空间音频数据集的灵活框架。
Binamix is an open-source Python library that enables programmatic binaural mixing using the SADIE II Database, providing a flexible framework for generating spatial audio datasets to support research and development in immersive audio applications.

Authors:Dongliang Guo, Mengxuan Hu, Zihan Guan, Thomas Hartvigsen, Sheng Li
Title: BalancEdit: Dynamically Balancing the Generality-Locality Trade-off in Multi-modal Model Editing
Abstract:
Large multi-modal models inevitably decay over time as facts update and previously learned information becomes outdated. Traditional approaches such as fine-tuning are often impractical for updating these models due to their size and complexity. Instead, direct knowledge editing within the models presents a more viable solution. Current model editing techniques, however, typically overlook the unique influence ranges of different facts, leading to compromised model performance in terms of both generality and locality. To address this issue, we introduce the concept of the generality-locality trade-off in multi-modal model editing. We develop a new model editing dataset named OKEDIT, specifically designed to effectively evaluate this trade-off. Building on this foundation, we propose \textbf{BalancEdit}, a novel method for balanced model editing that dynamically achieves an optimal balance between generality and locality. BalancEdit utilizes a unique mechanism that generates both positive and negative samples for each fact to accurately determine its influence scope and incorporates these insights into the model's latent space using a discrete, localized codebook of edits, without modifying the underlying model weights. To our knowledge, this is the first approach explicitly addressing the generality-locality trade-off in multi-modal model editing. Our comprehensive results confirm the effectiveness of BalancEdit, demonstrating minimal trade-offs while maintaining robust editing capabilities. Our code and dataset are available at https://github.com/donglgcn/BalancEdit/tree/MMOKVQA.
中文摘要: 大型多模态模型随时间推移会出现性能衰退,BalancEdit通过离散码本方法在不改变模型权重的情况下,动态平衡泛化性与局部性,有效解决了这一问题。
English Summary: Large multi-modal models face performance decay over time, which BalancEdit addresses by dynamically balancing generality and locality through a discrete codebook approach without altering model weights.

Authors:Yajuan Zhang, Jiahai Jiang, Yule Yan, Liang Yang, Ping Zhang
Title: 2DXformer: Dual Transformers for Wind Power Forecasting with Dual Exogenous Variables
Abstract:
Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. However, they exhibit two limitations. First, there is a lack of modeling for the inter-variable relationships, which limits the accuracy of the forecasts. Second, by treating endogenous and exogenous variables equally, it leads to unnecessary interactions between the endogenous and exogenous variables, increasing the complexity of the model. In this paper, we propose the 2DXformer, which, building upon the previous work's focus on spatiotemporal correlations, addresses the aforementioned two limitations. Specifically, we classify the inputs of the model into three types: exogenous static variables, exogenous dynamic variables, and endogenous variables. First, we embed these variables as variable tokens in a channel-independent manner. Then, we use the attention mechanism to capture the correlations among exogenous variables. Finally, we employ a multi-layer perceptron with residual connections to model the impact of exogenous variables on endogenous variables. Experimental results on two real-world large-scale datasets indicate that our proposed 2DXformer can further improve the performance of wind power forecasting. The code is available in this repository: \href{https://github.com/jseaj/2DXformer}{https://github.com/jseaj/2DXformer}.
中文: 本文提出的2DXformer模型通过将输入变量分类并采用注意力机制,解决了现有方法在变量间关系建模方面的不足,有效提升了风电功率预测的准确性。
English: The proposed 2DXformer model enhances wind power forecasting by addressing limitations in inter-variable relationship modeling and unnecessary interactions between endogenous and exogenous variables, achieving improved performance through specialized variable classification and attention mechanisms.

Authors:Vladimir Somers, Baptiste Standaert, Victor Joos, Alexandre Alahi, Christophe De Vleeschouwer
Title: CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking
Abstract:
Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD's valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at https://github.com/TrackingLaboratory/CAMELTrack.
中文: CAMEL提出了一种数据驱动的关联模块,通过直接学习数据中的关联策略摆脱人工启发式规则,同时保持检测跟踪的模块化优势,在多目标在线跟踪中实现了最先进的性能。
English: CAMEL introduces a data-driven association module that learns resilient tracking strategies without hand-crafted heuristics while preserving the modularity of tracking-by-detection, achieving state-of-the-art performance in online multi-object tracking.

Authors:Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Title: CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Abstract:
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
Chinese Summary: 提出的CAV-MAE Sync框架通过将音频序列与视频帧进行时间对齐、分离对比与重建目标、并增强空间定位能力,解决了视听学习中的关键挑战,在多个基准测试中实现了最先进的性能。
English Summary: The proposed CAV-MAE Sync framework addresses key audio-visual learning challenges by aligning audio sequences with video frames temporally, separating contrastive and reconstruction objectives, and enhancing spatial localization, achieving state-of-the-art results across multiple benchmarks.

Authors:Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hänsch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakuş, Paul L. Rosin
Title: Core-Set Selection for Data-efficient Land Cover Segmentation
Abstract:
The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.
中文: 针对地球观测任务,通过采用仅依赖图像、标签或其组合的新型核心集选择方法,可提升数据质量,使深度学习模型在遥感图像分割中的表现优于随机选择,有时甚至超过使用全部数据的效果。
English: Deep learning models for Earth Observation can achieve better performance by focusing on data quality through novel core-set selection methods, which outperform random selection and sometimes even full dataset training in remote sensing image segmentation tasks.

Authors:Kui Jiang, Yan Luo, Junjun Jiang, Xin Xu, Fei Ma, Fei Yu
Title: RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement
Abstract:
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components--structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at https://github.com/kkoucy/RD-UIE/tree/main
中文: 本研究提出了一种关系驱动的Mamba框架(RD-UIE),通过动态排序像素序列并整合全局上下文与局部特征,显著提升了水下图像增强效果,在性能指标上超越现有最佳方法。
English: The study introduces a relation-driven Mamba framework (RD-UIE) that enhances underwater image processing by dynamically sorting pixel sequences and integrating global context with local features, achieving superior performance over existing methods.

Authors:Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, Feng Zhao
Title: FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
Abstract:
Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to motion inconsistency and visual quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements. Code is available at https://github.com/JosephTiTan/FreePCA.
中文:FreePCA提出了一种无需训练的方法,通过主成分分析解耦并整合长视频生成中的全局外观一致性与局部运动质量,无需额外训练即可显著提升性能。
English: FreePCA introduces a training-free method using Principal Component Analysis to decouple and integrate global appearance consistency with local motion quality in long video generation, significantly improving performance without additional training.

Authors:Murtadha Ahmed, Wenbo, Liu yunfeng
Title: MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in In-Context Learning (ICL). However, the fixed position length constraints in pre-trained models limit the number of demonstration examples. Recent efforts to extend context suffer from attention dispersion as the number of demonstrations increases. In this paper, we introduce Mitigating Attention Dispersion in large-scale ICL (MateICL) that enables LLMs to maintain effective self-attention as the context size grows. We first split the context into multiple windows, each filled to the model's context capacity, which are processed separately. Then, we introduce an additional layer to recalibrate the attention weights, prioritizing the query tokens as the number of demonstrations increases. Our empirical results show that MateICL can effectively leverage larger contexts to improve ICL performance. Compared to retrieval-based baselines, MateICL consistently achieves better performance without requiring an externally trained retrieval model. Despite recent advances in inference strategies (e.g., 32k token contexts), our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings. The code is publicly available at https://github.com/amurtadha/MateICL.
中文: 本文提出MateICL方法,通过分割上下文窗口并重新校准注意力权重来缓解大规模上下文学习中的注意力分散问题,使大语言模型能有效利用更长上下文,无需外部检索模型即可超越基于检索的基线方法。
English: This paper introduces MateICL, a method that splits context into windows and recalibrates attention weights to mitigate attention dispersion in large-scale in-context learning, enabling LLMs to effectively utilize larger contexts and outperform retrieval-based approaches without external models.

Authors:Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, Dacheng Tao
Title: Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities
Abstract:
Large language models (LLMs) have achieved impressive performance across various domains. However, the substantial hardware resources required for their training present a significant barrier to efficiency and scalability. To mitigate this challenge, low-precision training techniques have been widely adopted, leading to notable advancements in training efficiency. Despite these gains, low-precision training involves several components$\unicode{x2013}$such as weights, activations, and gradients$\unicode{x2013}$each of which can be represented in different numerical formats. The resulting diversity has created a fragmented landscape in low-precision training research, making it difficult for researchers to gain a unified overview of the field. This survey provides a comprehensive review of existing low-precision training methods. To systematically organize these approaches, we categorize them into three primary groups based on their underlying numerical formats, which is a key factor influencing hardware compatibility, computational efficiency, and ease of reference for readers. The categories are: (1) fixed-point and integer-based methods, (2) floating-point-based methods, and (3) customized format-based methods. Additionally, we discuss quantization-aware training approaches, which share key similarities with low-precision training during forward propagation. Finally, we highlight several promising research directions to advance this field. A collection of papers discussed in this survey is provided in https://github.com/Hao840/Awesome-Low-Precision-Training.
中文: 大语言模型面临硬件效率挑战,促使采用低精度训练方法,按定点、浮点和自定义数值格式分类,以提升可扩展性并统一研究进展。
English: Large language models face hardware efficiency challenges, leading to the adoption of low-precision training methods categorized by numerical formats—fixed-point, floating-point, and customized—to enhance scalability and unify research efforts.

Authors:Lokesh Nagalapatti, Ashutosh Srivastava, Sunita Sarawagi, Amit Sharma
Title: Robust Root Cause Diagnosis using In-Distribution Interventions
Abstract:
Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) **Anomaly:** root cause nodes should take on anomalous values; 2) **Fix:** had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and bounding the errors in assessing the fix condition using interventional and counterfactual estimates. We then conduct experiments by systematically varying the SCM's complexity to demonstrate the cases where IDI's interventional approach outperforms the counterfactual approach and vice versa. Experiments on both synthetic and PetShop RCD benchmark datasets demonstrate that \our\ consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. Code is released at https://github.com/nlokeshiisc/IDI_release.
中文摘要:IDI算法通过采用分布内干预方法评估异常和修复条件,有效识别复杂系统中的根本原因,在准确性和鲁棒性方面优于现有方法。
English Summary: The IDI algorithm effectively identifies root causes in complex systems by evaluating anomaly and fix conditions using reliable in-distribution interventions, outperforming existing methods in accuracy and robustness.

Authors:Ihab Tabbara, Hussein Sibai
Title: Learning Conservative Neural Control Barrier Functions from Offline Data
Abstract:
Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms for such filters, however, suffer from the curse-of-dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we add to this set of approaches an algorithm for training neural control barrier functions from offline datasets. Such functions can be used to design constraints for quadratic programs that are then used as safety filters. Our algorithm trains these functions so that the system is not only prevented from reaching unsafe states but is also disincentivized from reaching out-of-distribution ones, at which they would be less reliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety while minimally affecting task performance. Source code is available at https://github.com/tabz23/CCBF.
中文: 本文提出保守控制屏障函数(CCBFs),一种基于离线数据训练的深度学习方法,通过阻止系统进入不安全及分布外状态来增强安全过滤器,在安全性和任务性能上优于现有方法。
English: This paper introduces Conservative Control Barrier Functions (CCBFs), a deep learning approach trained from offline data to enhance safety filters by preventing systems from reaching both unsafe and out-of-distribution states, outperforming existing methods in safety and task performance.

Authors:Viktor Kocur, Charalambos Tzamos, Yaqing Ding, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova
Title: Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?
Abstract:
Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at https://github.com/kocurvik/rdnet.
Chinese: 本文证明在相对位姿估计中无需使用复杂的径向畸变最小化解算器,通过简单的畸变参数采样或基于神经网络的校准方法即可在实际应用中达到相似效果。
English: This paper demonstrates that complex minimal radial distortion solvers are unnecessary for relative pose estimation, showing that simple sampling of undistortion parameters or neural network-based calibration can achieve comparable results in practice.

Authors:Quang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan, Cuong A. Pham, Qinbo Sun, Weimin Qi, Kentaro Inui, Dezhen Song
Title: SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Abstract:
Efficient path planning in robotics, particularly within large-scale, complex environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability hinder real-time deployment on edge devices. We present SmallPlan - a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like distance travel, providing more efficient path planning. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics. Our source code is available here: https://github.com/quangpham2006/SmallPlan
中文: SmallPlan提出了一种创新框架,利用大型语言模型作为教师模型来训练轻量级小语言模型,实现机器人高效路径规划,在保持与大型模型竞争性能的同时具备资源效率,适用于边缘设备部署。
English: SmallPlan introduces a novel framework that uses Large Language Models as teachers to train lightweight Small Language Models for efficient path planning in robotics, enabling competitive performance with larger models while being resource-efficient for edge-device deployment.

Authors:Fadi Abdeladhim Zidi, Abdelkrim Ouafi, Fares Bougourzi, Cosimo Distante, Abdelmalik Taleb-Ahmed
Title: Advancing Wheat Crop Analysis: A Survey of Deep Learning Approaches Using Hyperspectral Imaging
Abstract:
As one of the most widely cultivated and consumed crops, wheat is essential to global food security. However, wheat production is increasingly challenged by pests, diseases, climate change, and water scarcity, threatening yields. Traditional crop monitoring methods are labor-intensive and often ineffective for early issue detection. Hyperspectral imaging (HSI) has emerged as a non-destructive and efficient technology for remote crop health assessment. However, the high dimensionality of HSI data and limited availability of labeled samples present notable challenges. In recent years, deep learning has shown great promise in addressing these challenges due to its ability to extract and analysis complex structures. Despite advancements in applying deep learning methods to HSI data for wheat crop analysis, no comprehensive survey currently exists in this field. This review addresses this gap by summarizing benchmark datasets, tracking advancements in deep learning methods, and analyzing key applications such as variety classification, disease detection, and yield estimation. It also highlights the strengths, limitations, and future opportunities in leveraging deep learning methods for HSI-based wheat crop analysis. We have listed the current state-of-the-art papers and will continue tracking updating them in the following https://github.com/fadi-07/Awesome-Wheat-HSI-DeepLearning.
Chinese: 本综述填补了深度学习在小麦作物高光谱成像分析领域缺乏全面调研的空白,通过总结数据集、追踪方法进展并分析品种分类、病害检测和产量估算等关键应用,同时指出了局限性和未来机遇。
English: This review addresses the lack of a comprehensive survey on deep learning applications for hyperspectral imaging in wheat crop analysis by summarizing datasets, tracking method advancements, and examining key uses like disease detection and yield estimation, while highlighting limitations and future opportunities.

Authors:Branko Brkljač, Milan Brkljač
Title: Person detection and re-identification in open-world settings of retail stores and public spaces
Abstract:
Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.
中文: 本文探讨了智能城市中行人重识别系统在开放环境下面临的挑战与解决方案,重点分析了系统架构设计、实时性能评估以及在零售和公共空间中的实际应用与优化方向。
English: This abstract discusses the challenges and solutions for person re-identification in smart city applications, focusing on system design, performance evaluation in real-world settings, and potential improvements for retail and public space analytics.

Authors:Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Title: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
Chinese: 本文首次系统综述了基于大语言模型的人机协作系统,通过整合人类输入来提升系统性能、可靠性和安全性,同时应对自主智能体存在的幻觉与伦理风险等挑战。
English: This paper presents the first comprehensive survey of LLM-based human-agent systems (LLM-HAS), which integrate human input to enhance performance, reliability, and safety while addressing challenges like hallucinations and ethical risks in autonomous agents.

Authors:Zhengbin Zhang, Yan Wu, Hongkun Zhang
Title: Fast2comm:Collaborative perception combined with prior knowledge
Abstract:
Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at https://github.com/Zhangzhengbin-TJ/Fast2comm.
中文:提出的Fast2comm框架通过生成区分性置信度特征、选择信息丰富的空间先验特征以优化带宽和适应性,并解耦融合策略实现动态带宽使用,在实验中展现出卓越性能。
English: The proposed Fast2comm framework enhances collaborative perception by generating discriminative confidence features, selecting informative spatial prior features to optimize bandwidth and adaptability, and decoupling fusion strategies for dynamic bandwidth use, achieving superior performance in experiments.

Authors:Jiajia Li, Xinda Qi, Seyed Hamidreza Nabaei, Meiqi Liu, Dong Chen, Xin Zhang, Xunyuan Yin, Zhaojian Li
Title: A Survey on 3D Reconstruction Techniques in Plant Phenotyping: From Classical Methods to Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and Beyond
Abstract:
Plant phenotyping plays a pivotal role in understanding plant traits and their interactions with the environment, making it crucial for advancing precision agriculture and crop improvement. 3D reconstruction technologies have emerged as powerful tools for capturing detailed plant morphology and structure, offering significant potential for accurate and automated phenotyping. This paper provides a comprehensive review of the 3D reconstruction techniques for plant phenotyping, covering classical reconstruction methods, emerging Neural Radiance Fields (NeRF), and the novel 3D Gaussian Splatting (3DGS) approach. Classical methods, which often rely on high-resolution sensors, are widely adopted due to their simplicity and flexibility in representing plant structures. However, they face challenges such as data density, noise, and scalability. NeRF, a recent advancement, enables high-quality, photorealistic 3D reconstructions from sparse viewpoints, but its computational cost and applicability in outdoor environments remain areas of active research. The emerging 3DGS technique introduces a new paradigm in reconstructing plant structures by representing geometry through Gaussian primitives, offering potential benefits in both efficiency and scalability. We review the methodologies, applications, and performance of these approaches in plant phenotyping and discuss their respective strengths, limitations, and future prospects (https://github.com/JiajiaLi04/3D-Reconstruction-Plants). Through this review, we aim to provide insights into how these diverse 3D reconstruction techniques can be effectively leveraged for automated and high-throughput plant phenotyping, contributing to the next generation of agricultural technology.
中文: 本文综述了植物表型分析中的三维重建技术,比较了传统方法、神经辐射场(NeRF)和三维高斯泼溅(3DGS),重点分析了它们的优势、局限性和在推动自动化农业技术发展方面的潜力。
English: This paper reviews 3D reconstruction techniques for plant phenotyping, comparing classical methods, Neural Radiance Fields (NeRF), and 3D Gaussian Splatting (3DGS), highlighting their strengths, limitations, and potential for advancing automated agricultural technology.

Authors:Marius-Constantin Dinu
Title: Primality Testing via Circulant Matrix Eigenvalue Structure: A Novel Approach Using Cyclotomic Field Theory
Abstract:
This paper presents a novel primality test based on the eigenvalue structure of circulant matrices constructed from roots of unity. We prove that an integer $n > 2$ is prime if and only if the minimal polynomial of the circulant matrix $C_n = W_n + W_n^2$ has exactly two irreducible factors over $\mathbb{Q}$. This characterization connects cyclotomic field theory with matrix algebra, providing both theoretical insights and practical applications. We demonstrate that the eigenvalue patterns of these matrices reveal fundamental distinctions between prime and composite numbers, leading to a deterministic primality test. Our approach leverages the relationship between primitive roots of unity, Galois theory, and the factorization of cyclotomic polynomials. We provide comprehensive experimental validation across various ranges of integers, discuss practical implementation considerations, and analyze the computational complexity of our method in comparison with established primality tests. The visual interpretation of our mathematical framework provides intuitive understanding of the algebraic structures that distinguish prime numbers. Our experimental validation demonstrates that our approach offers a deterministic alternative to existing methods, with performance characteristics reflecting its algebraic foundations.
本文提出了一种基于单位根构造的循环矩阵特征值结构的素数判定方法,证明了整数为素数的充要条件是该矩阵的最小多项式在有理数域上恰好有两个不可约因子。
This paper introduces a deterministic primality test by analyzing the eigenvalue patterns of circulant matrices derived from roots of unity, establishing that an integer is prime precisely when the matrix's minimal polynomial has exactly two irreducible factors over the rationals.

Authors:Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Abstract:
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
中文: 本文提出T2I-R1模型,通过结合双层思维链推理与强化学习,在语义规划和像素处理层面优化文本到图像生成,显著提升了生成性能并超越了现有先进模型。
English: This paper introduces T2I-R1, a reasoning-enhanced text-to-image model that integrates a bi-level chain-of-thought process with reinforcement learning to optimize both semantic planning and pixel processing, achieving significant performance improvements over baseline and state-of-the-art models.

Authors:Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee
Title: Visual Test-time Scaling for GUI Agent Grounding
Abstract:
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6\% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.
中文摘要:RegionFocus是一种视觉测试时缩放方法,通过动态聚焦相关区域并采用图像作为地图的机制,显著提升了视觉语言模型代理的性能,在基准测试中实现了超过24%的性能提升,并以61.6%的准确率创下了新的最先进接地性能记录。
English Summary: RegionFocus is a visual test-time scaling method that enhances Vision Language Model Agents by dynamically zooming into relevant regions and using an image-as-map mechanism, achieving significant performance improvements of over 24% on benchmarks and setting a new state-of-the-art grounding accuracy of 61.6%.

Authors:Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand
Title: MINERVA: Evaluating Complex Video Reasoning
Abstract:
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.
Chinese: MINERVA数据集通过提供带有详细推理路径的复杂问题,弥补了现有视频基准仅关注结果监督的不足,旨在检验多模态模型是否真正具备结合感知与时间信息进行视频推理的能力。
English: The MINERVA dataset addresses the limitations of current video benchmarks by providing detailed reasoning traces alongside complex questions, challenging multimodal models to demonstrate genuine perceptual and temporal reasoning rather than relying on superficial cues.

Authors:Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan
Title: Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Abstract:
Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents. While prior surveys have focused on memory applications with LLMs (e.g., enabling personalized memory in conversational agents), they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric and contextual forms, and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}{https://github.com/Elvin-Yiming-Du/Survey\_Memory\_in\_AI}.}.
本调查通过将记忆表征分类为参数化与情境化形式并引入六种基本记忆操作,重构了人工智能记忆系统,为基于大语言模型的智能体研究提供了结构化视角并指明了未来方向。
This survey reframes AI memory systems by categorizing representations into parametric and contextual forms and introducing six fundamental memory operations, offering a structured perspective on research and future directions in LLM-based agents.

Authors:Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen
Title: DeepCritic: Deliberate Critique with Large Language Models
Abstract:
As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.
Chinese: 本研究提出一个两阶段框架,通过先使用精细化分步评论进行监督微调,再结合强化学习,显著提升大语言模型的数学批判能力,最终模型在错误识别和纠错反馈方面均优于现有评论模型。
English: This study introduces a two-stage framework to enhance LLMs' math critique ability by first fine-tuning with deliberate step-wise critiques and then applying reinforcement learning, resulting in a model that outperforms existing critics and improves error correction.

Authors:Atahan Karagoz
Title: OmicsCL: Unsupervised Contrastive Learning for Cancer Subtype Discovery and Survival Stratification
Abstract:
Unsupervised learning of disease subtypes from multi-omics data presents a significant opportunity for advancing personalized medicine. We introduce OmicsCL, a modular contrastive learning framework that jointly embeds heterogeneous omics modalities-such as gene expression, DNA methylation, and miRNA expression-into a unified latent space. Our method incorporates a survival-aware contrastive loss that encourages the model to learn representations aligned with survival-related patterns, without relying on labeled outcomes. Evaluated on the TCGA BRCA dataset, OmicsCL uncovers clinically meaningful clusters and achieves strong unsupervised concordance with patient survival. The framework demonstrates robustness across hyperparameter configurations and can be tuned to prioritize either subtype coherence or survival stratification. Ablation studies confirm that integrating survival-aware loss significantly enhances the predictive power of learned embeddings. These results highlight the promise of contrastive objectives for biological insight discovery in high-dimensional, heterogeneous omics data.
中文: OmicsCL是一种对比学习框架,通过融合多组学数据并引入生存感知损失,无需标记结果即可有效识别具有临床意义的疾病亚型。
English: OmicsCL is a contrastive learning framework that integrates multi-omics data into a unified latent space with survival-aware loss, effectively identifying clinically relevant disease subtypes without labeled outcomes.

Authors:Marco Braga, Pranav Kasela, Alessandro Raganato, Gabriella Pasi
Title: Investigating Task Arithmetic for Zero-Shot Information Retrieval
Abstract:
Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at https://github.com/DetectiveMB/Task-Arithmetic-for-ZS-IR.
中文摘要:本文提出的任务算术方法通过简单数学运算整合不同任务和领域的预训练模型权重,使大语言模型无需额外微调即可实现有效的零样本文档重排,在多项测试中显著提升了检索性能。
English Summary: This paper introduces Task Arithmetic, a technique that adapts Large Language Models for zero-shot document re-ranking by combining pre-trained weights from different tasks and domains through simple mathematical operations, achieving significant performance improvements without additional fine-tuning.

Authors:Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Huiyu Zhou, Jinchang Ren, Shiming Xiang, Xiangtai Li, Guangliang Cheng
Title: Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
Abstract:
Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 Mamba-based remote sensing studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (https://github.com/BaoBao0926/Awesome-Mamba-in-Remote-Sensing) to foster community-driven advancements.
中文: 本综述全面评述了基于Mamba的状态空间模型在遥感领域的变革性应用,通过线性计算扩展和全局上下文建模克服了CNN与ViT的局限,并系统分析了其创新架构、实际应用与未来发展方向。
English: This survey comprehensively reviews Mamba-based State Space Models as a transformative solution for remote sensing, overcoming limitations of CNNs and ViTs through linear computational scaling and global context modeling, while systematically analyzing innovations, applications, and future directions.

Authors:Haozheng Luo, Chenghao Qiu, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu
Title: Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Abstract:
To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model with strong compression performance and fast adaptability. GERM improves upon models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. By removing outliers during both pre-training and fine-tuning, this approach accelerates adaptation, reduces computational costs, and enhances quantization robustness within acceptable loss margins. Additionally, we propose GERM-T, a strategy that employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings. Code is available at https://github.com/MAGICS-LAB/GERM.
中文: GERM是一种基因组基础模型,通过消除注意力机制中的异常值来提高效率、适应性和量化鲁棒性,在微调和量化方面实现了显著的性能提升,为资源受限环境提供了实用解决方案。
English: GERM is a genomic foundation model designed to overcome computational limitations by eliminating outliers in attention mechanisms, which enhances efficiency, adaptability, and quantization robustness, achieving significant performance improvements in fine-tuning and quantization.

Authors:Alex Schutz, Yang You, Matias Mattamala, Ipek Caliskanelli, Bruno Lacerda, Nick Hawes
Title: A Finite-State Controller Based Offline Solver for Deterministic POMDPs
Abstract:
Deterministic partially observable Markov decision processes (DetPOMDPs) often arise in planning problems where the agent is uncertain about its environmental state but can act and observe deterministically. In this paper, we propose DetMCVI, an adaptation of the Monte Carlo Value Iteration (MCVI) algorithm for DetPOMDPs, which builds policies in the form of finite-state controllers (FSCs). DetMCVI solves large problems with a high success rate, outperforming existing baselines for DetPOMDPs. We also verify the performance of the algorithm in a real-world mobile robot forest mapping scenario.
中文摘要:DetMCVI算法将蒙特卡洛值迭代应用于确定性部分可观测马尔可夫决策过程,通过构建有限状态控制器策略,在解决大规模问题时表现优异,并在真实机器人森林测绘场景中得到验证。
English Summary: DetMCVI adapts Monte Carlo Value Iteration for deterministic partially observable Markov decision processes, effectively solving large problems and outperforming existing methods, as validated in a real-world robot mapping application.

Authors:Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal
Title: Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities
Abstract:
Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: https://github.com/Lucas-rbnt/BM-MAE
中文: BM-MAE是一种针对多模态MRI的预训练策略,能适应任意可用模态组合,无需调整架构即可实现高效微调和缺失模态重建。
English: BM-MAE is a pre-training strategy for multimodal MRI that adapts to any available modality combination, enabling efficient fine-tuning and missing modality reconstruction without architectural changes.

Authors:Jorgen Cani, Christos Diou, Spyridon Evangelatos, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Title: X-ray illicit object detection using hybrid CNN-transformer neural network architectures
Abstract:
In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.
Chinese: 本研究评估了用于X射线安检物体检测的混合CNN-Transformer架构,发现相较于YOLOv8等传统卷积网络,该架构在应对领域偏移时展现出更强鲁棒性,基于公开数据集的全面测试揭示了不同架构组合的性能特点。
English: This study evaluates hybrid CNN-transformer architectures for X-ray security object detection, finding they offer enhanced robustness against domain shifts compared to conventional CNNs like YOLOv8, with comprehensive testing on public datasets revealing distinct performance trade-offs.

Authors:Yue Meng, Chuchu Fan
Title: TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching
Abstract:
Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many real-world applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method's capability to solve complex STLs and robustness to out-distribution STL specifications. Code is available at https://github.com/mengyuest/TeLoGraF
中文: 本文提出TeLoGraF方法,利用图神经网络和流匹配技术高效解决通用信号时序逻辑规范,在多种仿真环境中展现出优越的性能和速度。
English: This paper introduces TeLoGraF, a method using Graph Neural Networks and flow-matching to efficiently solve general signal temporal logic specifications, demonstrating superior performance and speed in various simulations.

Authors:Chanwoo Kim, Jinkyu Sung, Yebonn Han, Joonseok Lee
Title: Graph Spectral Filtering with Chebyshev Interpolation for Recommendation
Abstract:
Graph convolutional networks have recently gained prominence in collaborative filtering (CF) for recommendations. However, we identify potential bottlenecks in two foundational components. First, the embedding layer leads to a latent space with limited capacity, overlooking locally observed but potentially valuable preference patterns. Also, the widely-used neighborhood aggregation is limited in its ability to leverage diverse preference patterns in a fine-grained manner. Building on spectral graph theory, we reveal that these limitations stem from graph filtering with a cut-off in the frequency spectrum and a restricted linear form. To address these issues, we introduce ChebyCF, a CF framework based on graph spectral filtering. Instead of a learned embedding, it takes a user's raw interaction history to utilize the full spectrum of signals contained in it. Also, it adopts Chebyshev interpolation to effectively approximate a flexible non-linear graph filter, and further enhances it by using an additional ideal pass filter and degree-based normalization. Through extensive experiments, we verify that ChebyCF overcomes the aforementioned bottlenecks and achieves state-of-the-art performance across multiple benchmarks and reasonably fast inference. Our code is available at https://github.com/chanwoo0806/ChebyCF.
中文:ChebyCF提出了一种基于图谱滤波和切比雪夫插值的协同过滤框架,有效解决了嵌入层容量限制和邻域聚合的不足,在多个基准测试中实现了领先性能。
English: ChebyCF introduces a collaborative filtering framework using graph spectral filtering and Chebyshev interpolation to overcome limitations in embedding capacity and neighborhood aggregation, achieving state-of-the-art performance across benchmarks.

Authors:Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yixuan Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Jürgen Schmidhuber, Chao Huang
Title: Directly Forecasting Belief for Reinforcement Learning with Delays
Abstract:
Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT's capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines. Code is available at https://github.com/QingyuanWuNothing/DFBT.
中文摘要:提出的DFBT方法通过直接从观测值预测状态来减少延迟强化学习中的累积误差,在理论保证和实验基准上均显著优于现有方法。
English Summary: The proposed DFBT method directly forecasts states from observations to minimize compounding errors in reinforcement learning with delays, significantly outperforming existing approaches in both theoretical guarantees and experimental benchmarks.

Authors:Jeremias Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero Antolin
Title: Self-Ablating Transformers: More Interpretability, Less Sparsity
Abstract:
A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretability directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretability tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and interpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at https://github.com/keenanpepper/self-ablating-transformers.
中文: 本研究提出了一种自消融机制,强制语言变换器进行选择性激活,发现尽管该方法降低了整体稀疏性,但通过促进局部专业化和特征集中,在不牺牲性能的情况下增强了可解释性。
English: This study introduces a self-ablation mechanism that enforces selective activation in language transformers, revealing that while it reduces overall sparsity, it enhances interpretability by promoting local specialization and feature concentration without sacrificing performance.

Authors:Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh
Title: JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Abstract:
We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, namely, adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timesteps of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a viable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.
中文: JointDiT是一种扩散变换器,通过自适应调度权重和不平衡时间步采样技术,实现了RGB与深度信息的联合分布建模,能生成高质量图像和精确深度图。
English: JointDiT is a diffusion transformer that models the joint distribution of RGB and depth, generating high-fidelity images and accurate depth maps through adaptive scheduling weights and unbalanced timestep sampling.

Authors:Thomas Flinkow, Marco Casadio, Colin Kessler, Rosemary Monahan, Ekaterina Komendantskaya
Title: A General Framework for Property-Driven Machine Learning
Abstract:
Neural networks have been shown to frequently fail to learn critical safety and correctness properties purely from data, highlighting the need for training methods that directly integrate logical specifications. While adversarial training can be used to improve robustness to small perturbations within $ε$-cubes, domains other than computer vision -- such as control systems and natural language processing -- may require more flexible input region specifications via generalised hyper-rectangles. Differentiable logics offer a way to encode arbitrary logical constraints as additional loss terms that guide the learning process towards satisfying these constraints. In this paper, we investigate how these two complementary approaches can be unified within a single framework for property-driven machine learning, as a step toward effective formal verification of neural networks. We show that well-known properties from the literature are subcases of this general approach, and we demonstrate its practical effectiveness on a case study involving a neural network controller for a drone system. Our framework is made publicly available at https://github.com/tflinkow/property-driven-ml.
神经网络常无法仅从数据中学习安全属性,需要通过可微逻辑和对抗训练将逻辑规范融入统一框架,以提升网络在控制等领域的可靠性与验证效果。
Neural networks often fail to learn safety properties from data alone, necessitating training methods that incorporate logical specifications through differentiable logics and adversarial training within a unified framework.

Authors:Fabian Woller, Lis Arend, Christian Fuchsberger, Markus List, David B. Blumenthal
Title: NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data
Abstract:
Existing Python libraries and tools lack the ability to efficiently compute statistical test results for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability become essential considerations for a particular usecase. Relevant research areas where such limitations arise include interactive tools and databases for exploratory analysis of biomedical data. To address this problem, we present the Python package NApy, which relies on a Numba and C++ backend with OpenMP parallelization to enable scalable statistical testing for mixed-type datasets in the presence of missing values. Both with respect to runtime and memory consumption, NApy outperforms competitor tools and baseline implementations with naive Python-based parallelization by orders of magnitude, thereby enabling on-the-fly analyses in interactive applications. NApy is publicly available at https://github.com/DyHealthNet/NApy.
Chinese: NApy Python 包通过基于 Numba 和 C++ 的 OpenMP 并行化后端,能够高效处理含缺失值的大规模数据集统计检验,在运行速度和内存消耗上大幅优于现有工具。
English: The NApy Python package efficiently computes statistical tests for large datasets with missing values, significantly outperforming existing tools in speed and memory usage through its Numba and C++ backend with OpenMP parallelization.

Authors:Chenhao Xu, Longxiang Gao, Yuan Miao, Xi Zheng
Title: Distributed Retrieval-Augmented Generation
Abstract:
As large language models (LLMs) become increasingly adopted on edge devices, Retrieval-Augmented Generation (RAG) is gaining prominence as a solution to address factual deficiencies and hallucinations by integrating external knowledge. However, centralized RAG architectures face significant challenges in data privacy and scalability. For instance, smart healthcare services often rely on collecting sensitive patient data and building a centralized knowledge base to provide better diagnosis and treatment advice, while privacy concerns significantly impede this process. Besides, maintaining a comprehensive and continuously updated knowledge base is costly, particularly in response to regional epidemics and rapidly mutating viruses. To address these challenges, this paper introduces Distributed Retrieval-Augmented Generation (DRAG), a novel framework that improves data privacy by eliminating the need for a centralized knowledge base and restoring data control to owners. DRAG incorporates a Topic-Aware Random Walk (TARW) algorithm that leverages LLMs to extract query topics and facilitate targeted peer discovery within a peer-to-peer network, enabling efficient knowledge retrieval in decentralized environments. Extensive experiments across three diverse datasets and LLMs demonstrate that DRAG with TARW achieves near-centralized RAG performance by using half as many messages as flooding. The code is available at https://github.com/xuchenhao001/DRAG.
中文: 针对集中式RAG的隐私和扩展性问题,本文提出分布式检索增强生成框架DRAG,通过主题感知算法在点对点网络中实现高效知识检索,同时保持接近集中式的性能。
English: To address privacy and scalability issues in centralized RAG, this paper proposes DRAG, a decentralized framework that uses a topic-aware algorithm for efficient peer-to-peer knowledge retrieval while maintaining near-centralized performance.

Authors:Sindre M. Hegre, Welf Rehberg, Mihir Kulkarni, Kostas Alexis
Title: A Neural Network Mode for PX4 on Embedded Flight Controllers
Abstract:
This paper contributes an open-sourced implementation of a neural-network based controller framework within the PX4 stack. We develop a custom module for inference on the microcontroller while retaining all of the functionality of the PX4 autopilot. Policies trained in the Aerial Gym Simulator are converted to the TensorFlow Lite format and then built together with PX4 and flashed to the flight controller. The policies substitute the control-cascade within PX4 to offer an end-to-end position-setpoint tracking controller directly providing normalized motor RPM setpoints. Experiments conducted in simulation and the real-world show similar tracking performance. We thus provide a flight-ready pipeline for testing neural control policies in the real world. The pipeline simplifies the deployment of neural networks on embedded flight controller hardware thereby accelerating research on learning-based control. Both the Aerial Gym Simulator and the PX4 module are open-sourced at https://github.com/ntnu-arl/aerial_gym_simulator and https://github.com/SindreMHegre/PX4-Autopilot-public/tree/for_paper. Video: https://youtu.be/lY1OKz_UOqM?si=VtzL243BAY3lblTJ.
中文: 本文提出了一种在PX4自驾仪中实现神经网络控制器的开源框架,可直接控制电机并在真实环境中部署,其仿真与实验性能表现相当。
English: This paper presents an open-source framework for implementing neural-network controllers in the PX4 autopilot, enabling direct motor control and real-world deployment with comparable simulation and experimental performance.

Authors:Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yu Zhang, Yuchi Huo, Chao Wu
Title: Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly
Abstract:
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.
Chinese: 本文提出了一种零样本三维零件装配方法,利用预训练扩散模型作为判别器指导零件组装而无需标注数据,通过创新的推开策略和迭代优化,在实验中超越了监督学习方法的效果。
English: This paper introduces a zero-shot 3D part assembly method using pre-trained diffusion models as discriminators to guide part manipulation without labeled data, achieving superior performance over supervised methods through a novel pushing-away strategy and iterative optimization.

Authors:Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, Tiejun Huang
Title: SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos
Abstract:
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.
中文摘要:提出的SOTA框架利用脉冲相机的高时间分辨率提升视觉显著性检测,通过微观和全局去偏组件解决复合噪声与样本偏差问题,在多种条件下优于现有方法。
English Summary: The proposed SOTA framework leverages spike cameras' high temporal resolution to enhance saliency detection while addressing composite noise and bias through micro and global debiasing components, outperforming existing methods in diverse conditions.

Authors:Feng Xue, Wenzhuang Xu, Guofeng Zhong, Anlong Minga, Nicu Sebe
Title: Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation
Abstract:
Open-vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top-performing methods currently integrate 2D segmentation with geometry-aware 3D primitives. However, the advantage would be lost without high-fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross-view association pre-processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre-associations. The core idea is that NeRF's implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross-view supervision. We propose a three-phase training framework for NeRF, initialization-disambiguation-refinement, whereby the instance IDs are corrected using the initially-learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF-rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image-based methods and competes with the latest 2D-3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \href{https://github.com/mRobotit/Cues3D}{github}
中文摘要:Cues3D是一种基于神经辐射场隐式三维几何的新型开放词汇全景分割方法,无需跨视图监督即可实现全局一致的目标区分,在多个数据集上超越了现有方法。
English Summary: Cues3D is a novel method for open-vocabulary 3D panoptic segmentation that leverages Neural Radiance Field's implicit 3D geometry to achieve globally consistent object distinction without cross-view supervision, outperforming existing approaches across multiple datasets.

Authors:Usman Muhammad, Jorma Laaksonen, Lyudmila Mihaylova
Title: Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network
Abstract:
Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \href{https://github.com/Usman1021/lightweight}{https://github.com/Usman1021/lightweight}.
中文: 作者提出了一种轻量级的深度可分离扩张卷积网络(DSDCN)及定制损失函数,有效解决了高光谱超分辨率难题,在公开数据集上取得优异性能的同时保持了模型的高效性。
English: The authors propose a lightweight depthwise separable dilated convolutional network (DSDCN) with a custom loss function to effectively address hyperspectral super-resolution challenges, achieving competitive performance on public datasets while maintaining model efficiency.

Authors:Yu-Hsiang Lan, Eric K. Oermann
Title: Gateformer: Advancing Multivariate Time Series Forecasting through Temporal and Variate-Wise Attention with Gated Representations
Abstract:
There has been a recent surge of interest in time series modeling using the Transformer architecture. However, forecasting multivariate time series with Transformer presents a unique challenge as it requires modeling both temporal (cross-time) and variate (cross-variate) dependencies. While Transformer-based models have gained popularity for their flexibility in capturing both sequential and cross-variate relationships, it is unclear how to best integrate these two sources of information in the context of the Transformer architecture while optimizing for both performance and efficiency. We re-purpose the Transformer architecture to effectively model both cross-time and cross-variate dependencies. Our approach begins by embedding each variate independently into a variate-wise representation that captures its cross-time dynamics, and then models cross-variate dependencies through attention mechanisms on these learned embeddings. Gating operations in both cross-time and cross-variate modeling phases regulate information flow, allowing the model to focus on the most relevant features for accurate predictions. Our method achieves state-of-the-art performance across 13 real-world datasets and can be seamlessly integrated into other Transformer-based and LLM-based forecasters, delivering performance improvements up to 20.7\% over original models. Code is available at this repository: https://github.com/nyuolab/Gateformer.
中文摘要:本研究提出了一种基于Transformer的新方法,通过专门的嵌入和门控机制有效建模多元时间序列中的时间和变量间依赖关系,在多个数据集上实现了最先进的预测性能。
English Summary: This study introduces a novel Transformer-based approach that effectively models both temporal and cross-variate dependencies in multivariate time series forecasting through specialized embedding and gating mechanisms, achieving state-of-the-art performance across multiple datasets.

Authors:Zhijie Qiao, Haowei Li, Zhong Cao, Henry X. Liu
Title: LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
Abstract:
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving. However, the field still lacks a practical platform that enables dynamic model updates, rapid validation, fair comparison, and intuitive performance assessment. To that end, we introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving. LightEMMA provides a unified, VLM-based autonomous driving framework without ad hoc customizations, enabling easy integration with evolving state-of-the-art commercial and open-source models. We construct twelve autonomous driving agents using various VLMs and evaluate their performance on the challenging nuScenes prediction task, comprehensively assessing computational metrics and providing critical insights. Illustrative examples show that, although VLMs exhibit strong scenario interpretation capabilities, their practical performance in autonomous driving tasks remains a concern. Additionally, increased model complexity and extended reasoning do not necessarily lead to better performance, emphasizing the need for further improvements and task-specific designs. The code is available at https://github.com/michigan-traffic-lab/LightEMMA.
中文: LightEMMA是一个轻量级的端到端自动驾驶多模态模型,提供了动态更新与评估的统一框架;尽管当前视觉语言模型展现出强大的场景理解能力,但其实际驾驶性能仍需针对性的改进和优化。
English: LightEMMA is a lightweight, end-to-end multimodal model for autonomous driving that offers a unified framework for dynamic updates and evaluation, though current VLMs show promising scenario interpretation, their practical performance requires further task-specific enhancements.

Authors:Shingo Higashiguchi, Yasuko Matsubara, Koki Kawabata, Taichi Murayama, Yasushi Sakurai
Title: D-Tracker: Modeling Interest Diffusion in Social Activity Tensor Data Streams
Abstract:
Large quantities of social activity data, such as weekly web search volumes and the number of new infections with infectious diseases, reflect peoples' interests and activities. It is important to discover temporal patterns from such data and to forecast future activities accurately. However, modeling and forecasting social activity data streams is difficult because they are high-dimensional and composed of multiple time-varying dynamics such as trends, seasonality, and interest diffusion. In this paper, we propose D-Tracker, a method for continuously capturing time-varying temporal patterns within social activity tensor data streams and forecasting future activities. Our proposed method has the following properties: (a) Interpretable: it incorporates the partial differential equation into a tensor decomposition framework and captures time-varying temporal patterns such as trends, seasonality, and interest diffusion between locations in an interpretable manner; (b) Automatic: it has no hyperparameters and continuously models tensor data streams fully automatically; (c) Scalable: the computation time of D-Tracker is independent of the time series length. Experiments using web search volume data obtained from GoogleTrends, and COVID-19 infection data obtained from COVID-19 Open Data Repository show that our method can achieve higher forecasting accuracy in less computation time than existing methods while extracting the interest diffusion between locations. Our source code and datasets are available at {https://github.com/Higashiguchi-Shingo/D-Tracker.
中文: D-Tracker是一种可解释、自动且可扩展的方法,用于建模和预测高维社会活动数据流,在捕捉趋势和兴趣传播等时变模式方面实现了更高的准确性和效率。
English: D-Tracker is an interpretable, automatic, and scalable method for modeling and forecasting high-dimensional social activity data streams, achieving superior accuracy and efficiency in capturing time-varying patterns like trends and interest diffusion.

Authors:Xiaoxia Xu, Xidong Mu, Zhaolin Wang, Yuanwei Liu, Arumugam Nallanathan
Title: Pinching-Antenna Systems (PASS): Power Radiation Model and Optimal Beamforming Design
Abstract:
Pinching-antenna systems (PASS) improve wireless links by configuring the locations of activated pinching antennas along dielectric waveguides, namely pinching beamforming. In this paper, a novel adjustable power radiation model is proposed for PASS, where power radiation ratios of pinching antennas can be flexibly controlled by tuning the spacing between pinching antennas and waveguides. A closed-form pinching antenna spacing arrangement strategy is derived to achieve the commonly assumed equal-power radiation. Based on this, a practical PASS framework relying on discrete activation is considered, where pinching antennas can only be activated among a set of predefined locations. A transmit power minimization problem is formulated, which jointly optimizes the transmit beamforming, pinching beamforming, and the numbers of activated pinching antennas, subject to each user's minimum rate requirement. (1) To solve the resulting highly coupled mixed-integer nonlinear programming (MINLP) problem, branch-and-bound (BnB)-based algorithms are proposed for both single-user and multi-user scenarios, which is guaranteed to converge to globally optimal solutions. (2) A low-complexity many-to-many matching algorithm is further developed. Combined with the Karush-Kuhn-Tucker (KKT) theory, locally optimal and pairwise-stable solutions are obtained within polynomial-time complexity. Simulation results demonstrate that: (i) PASS significantly outperforms conventional multi-antenna architectures, particularly when the number of users and the spatial range increase; and (ii) The proposed matching-based algorithm achieves near-optimal performance, resulting in only a slight performance loss while significantly reducing computational overheads. Code is available at https://github.com/xiaoxiaxusummer/PASS_Discrete
中文摘要:本文提出了一种新型可调功率辐射模型用于夹持天线系统,通过分支定界算法和低复杂度匹配算法解决优化问题,相比传统多天线架构展现出显著性能优势。
English Summary: This paper introduces a novel adjustable power radiation model for pinching-antenna systems (PASS) and proposes both globally optimal branch-and-bound algorithms and a low-complexity matching algorithm to solve the resulting optimization problem, demonstrating significant performance improvements over conventional multi-antenna systems.

Authors:Xiaoxia Xu, Xidong Mu, Zhaolin Wang, Yuanwei Liu, Arumugam Nallanathan
Title: Pinching-Antenna Systems (PASS): Power Radiation Model and Optimal Beamforming Design
Abstract:
Pinching-antenna systems (PASS) improve wireless links by configuring the locations of activated pinching antennas along dielectric waveguides, namely pinching beamforming. In this paper, a novel adjustable power radiation model is proposed for PASS, where power radiation ratios of pinching antennas can be flexibly controlled by tuning the spacing between pinching antennas and waveguides. A closed-form pinching antenna spacing arrangement strategy is derived to achieve the commonly assumed equal-power radiation. Based on this, a practical PASS framework relying on discrete activation is considered, where pinching antennas can only be activated among a set of predefined locations. A transmit power minimization problem is formulated, which jointly optimizes the transmit beamforming, pinching beamforming, and the numbers of activated pinching antennas, subject to each user's minimum rate requirement. (1) To solve the resulting highly coupled mixed-integer nonlinear programming (MINLP) problem, branch-and-bound (BnB)-based algorithms are proposed for both single-user and multi-user scenarios, which is guaranteed to converge to globally optimal solutions. (2) A low-complexity many-to-many matching algorithm is further developed. Combined with the Karush-Kuhn-Tucker (KKT) theory, locally optimal and pairwise-stable solutions are obtained within polynomial-time complexity. Simulation results demonstrate that: (i) PASS significantly outperforms conventional multi-antenna architectures, particularly when the number of users and the spatial range increase; and (ii) The proposed matching-based algorithm achieves near-optimal performance, resulting in only a slight performance loss while significantly reducing computational overheads. Code is available at https://github.com/xiaoxiaxusummer/PASS_Discrete
中文摘要:本文提出了一种新型可调功率辐射模型用于夹持天线系统,通过分支定界算法和低复杂度匹配算法解决优化问题,相比传统多天线架构展现出显著性能优势。
English Summary: This paper introduces a novel adjustable power radiation model for pinching-antenna systems (PASS) and proposes both globally optimal branch-and-bound algorithms and a low-complexity matching algorithm to solve the resulting optimization problem, demonstrating significant performance improvements over conventional multi-antenna systems.

Authors:Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu
Title: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Abstract:
Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
中文摘要:本文针对大语言模型多智能体系统提出了自动化故障归因方法,通过Who&When数据集验证发现现有方法效果有限(最佳代理识别准确率53.5%,步骤定位仅14.2%),凸显该领域研究任重道远。
English Summary: This paper introduces automated failure attribution for LLM multi-agent systems, proposing the Who&When dataset and benchmarking methods that reveal significant challenges with current models achieving only 14.2-53.5% accuracy.

Authors:Filipp Nikitin, Ian Dunn, David Ryan Koes, Olexandr Isayev
Title: GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation
Abstract:
Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at https://github.com/isayevlab/geom-drugs-3dgen-evaluation.
中文: 本研究通过提出包含精确化合价定义和基于GFN2-xTB基准的校正框架,解决了当前三维分子生成评估方法的缺陷,为化学严谨性评估提供了更新指标和实践建议。
English: This study addresses flaws in current evaluation protocols for 3D molecular generation by proposing a corrected framework with accurate valency definitions and GFN2-xTB-based benchmarks, providing updated metrics and recommendations for chemically rigorous assessments.

Authors:Zheng Zhang, Jinyi Li, Yihuai Lan, Xiang Wang, Hao Wang
Title: An Empirical Study on Prompt Compression for Large Language Models
Abstract:
Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.
中文: 本研究评估了六种大型语言模型的提示压缩方法,发现它们能在保持响应质量的同时有效降低计算成本,且适度压缩在长上下文任务中甚至能提升性能。
English: This study evaluates six prompt compression methods for large language models, finding that they effectively reduce computational costs while maintaining response quality, with moderate compression even improving performance in long-context tasks.

Authors:Vinit K. Chavan
Title: Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips
Abstract:
Recent advances in representation learning have emphasized the role of embedding geometry in capturing semantic structure. Traditional sentence embeddings typically reside in unconstrained Euclidean spaces, which may limit their ability to reflect complex relationships in language. In this work, we introduce a novel framework that constrains sentence embeddings to lie on continuous manifolds -- specifically the unit sphere, torus, and Möbius strip -- using triplet loss as the core training objective. By enforcing differential geometric constraints on the output space, our approach encourages the learning of embeddings that are both discriminative and topologically structured. We evaluate our method on benchmark datasets (AG News and MBTI) and compare it to classical baselines including TF-IDF, Word2Vec, and unconstrained Keras-derived embeddings. Our results demonstrate that manifold-constrained embeddings, particularly those projected onto spheres and Möbius strips, significantly outperform traditional approaches in both clustering quality (Silhouette Score) and classification performance (Accuracy). These findings highlight the value of embedding in manifold space -- where topological structure complements semantic separation -- offering a new and mathematically grounded direction for geometric representation learning in NLP.
中文: 本研究提出了一种流形约束的句子嵌入框架,将嵌入投影到球面和莫比乌斯带等几何表面上,在聚类和分类任务中均展现出优于传统欧几里得方法的性能。
English: This study introduces a manifold-constrained sentence embedding framework that projects embeddings onto geometric surfaces like spheres and Möbius strips, demonstrating superior performance over traditional Euclidean methods in both clustering and classification tasks.

Authors:Hao Chen, Yukun Yan, Sen Mei, Wanxiang Che, Zhenghao Liu, Qi Shi, Xinze Li, Yuchun Fan, Pengcheng Huang, Qiushi Xiong, Zhiyuan Liu, Maosong Sun
Title: ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference.
中文:ClueAnchor通过从检索内容中提取关键线索并基于奖励优化选择推理路径,显著提升了推理完整性和鲁棒性,优于现有RAG方法。
English: ClueAnchor enhances Retrieval-Augmented Generation by extracting key clues from retrieved documents and optimizing reasoning paths through reward-based selection, significantly improving reasoning completeness and robustness over prior methods.

Authors:Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Title: Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration
Abstract:
Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.
中文摘要:MAEL框架通过图结构多智能体协作网络实现跨任务经验学习,让智能体在推理时检索高价值历史经验作为示例,从而显著提升多智能体系统的协作效率和解决方案质量。
English Summary: The MAEL framework enables LLM-based multi-agent systems to accumulate and reuse cross-task experiences through graph-structured collaboration, enhancing reasoning efficiency and solution quality by leveraging high-reward historical examples during inference.

Authors:Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun
Title: Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning
Abstract:
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
中文: 提出的R1-Router框架通过动态决定推理过程中何时及从何处检索知识,增强了多模态检索增强生成,通过自适应利用多样化知识库,性能超越基线模型超过7%。
English: The proposed R1-Router framework enhances multimodal retrieval-augmented generation by dynamically determining when and where to retrieve knowledge during reasoning, outperforming baseline models by over 7% through adaptive use of diverse knowledge bases.

Authors:Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Title: Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development
Abstract:
Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system -- Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" -- instructional transitions learned from historically successful trajectories -- which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.
Chinese: 针对大型语言模型和自主代理在复杂任务中的效率问题,我们提出了资源感知的多代理系统Co-Saving,通过利用经验性“捷径”减少冗余推理,相比ChatDev实现了50.85%的令牌使用量降低和10.06%的代码质量提升。
English: Recent advancements in LLMs and autonomous agents face efficiency challenges in complex tasks, leading to the proposal of Co-Saving, a resource-aware multi-agent system that uses experiential "shortcuts" to reduce token usage by 50.85% and improve code quality by 10.06% compared to ChatDev.

Authors:Xiaorong Wang, Ting Yang, Zhu Zhang, Shuo Wang, Zihan Zhou, Liner Yang, Zhiyuan Liu, Maosong Sun
Title: Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning
Abstract:
Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.
中文: 为解决长文本生成质量评估的难题,我们提出一种分治策略,将评估任务分解为局部评分与全局评估,结合人工反馈和主动学习算法以降低标注成本,实验证明该方法优于现有基准。
English: To overcome the limitations of evaluating long model-generated texts, we introduce a divide-and-conquer strategy that segments the text for localized and global assessments, enhanced by human feedback and an active learning algorithm to reduce annotation costs, achieving superior results over existing methods.

Authors:Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Title: Multi-Agent Collaboration via Evolving Orchestration
Abstract:
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.
中文摘要:该研究提出了一种提线木偶式多智能体协作范式,通过强化学习训练的中控协调器动态调度语言模型智能体,在提升任务性能的同时显著降低了计算成本。
English Summary: The proposed puppeteer-style paradigm uses a reinforcement learning-trained orchestrator to dynamically direct multiple large language model agents, achieving superior performance and computational efficiency through adaptive collective reasoning.

Authors:Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun
Title: The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training
Abstract:
Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.
中文: DIET框架通过难度感知训练,根据任务难度动态调整压缩策略,有效减少大型语言模型的令牌使用,在提升推理性能的同时保持响应长度与问题难度的自然关联。
English: The DIET framework introduces difficulty-aware training to reduce token usage in large language models by dynamically adjusting compression strategies based on task difficulty, improving both efficiency and reasoning performance while maintaining a natural length-difficulty relationship.

Authors:Haotian Chen, Zijun Song, Boye Niu, Ke Zhang, Litu Ou, Yaxi Lu, Zhong Zhang, Xin Cong, Yankai Lin, Zhiyuan Liu, Maosong Sun
Title: ToLeaP: Rethinking Development of Tool Learning with Large Language Models
Abstract:
Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.
中文摘要:工具学习使大语言模型能有效利用外部工具提升生产力,但基准测试的局限性阻碍了其自主学习与泛化能力,为此开发了ToLeaP平台进行评估并探索未来研究方向。
English Summary: Tool learning enables LLMs to use external tools for productivity gains, but challenges like benchmark limitations hinder autonomous learning and generalization, prompting the development of ToLeaP for evaluation and future research directions.

Authors:Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Yingji Liang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Haoning Wu, Bin Wang, Haoran Zhang, Guanyu Zhu, Qiyong Zhao, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min
Title: Scaling-up Perceptual Video Quality Assessment
Abstract:
The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a \textbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
中文: 本研究提出了OmniVQA框架,用于构建视频质量评估领域的大规模多模态指令数据库,并通过互补训练策略和专门基准测试证明了其在质量理解与评分任务中的领先性能。
English: The study introduces OmniVQA, a framework for creating large-scale multi-modal instruction databases in video quality assessment, and demonstrates its state-of-the-art performance through complementary training and specialized benchmarks.

Authors:Juntong Wang, Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min
Title: TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
Abstract:
Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.
中文: 针对文本驱动视频编辑缺乏专用质量评估模型的问题,本研究提出了包含人工标注的大规模基准数据集TDVE-DB,并开发了融合时空特征与大语言模型的新型评估模型TDVE-Assessor,该模型在多个评估维度上均实现了最优性能。
English: To address the lack of dedicated video quality assessment models for text-driven video editing, this study introduces TDVE-DB, a large-scale benchmark dataset with human annotations, and proposes TDVE-Assessor, a novel VQA model that integrates spatial-temporal features with a large language model, achieving state-of-the-art performance across multiple evaluation dimensions.

Authors:Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Jianhui Sun, Xinli Yue, Tianyi Wang, Huan Hou, Junda Lu, Xinyang Huang, Zitang Zhou, Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao, Trong-Hieu Nguyen-Mau, Minh-Hoang Le, Minh-Khoa Le-Phan, Duy-Nam Ly, Hai-Dang Nguyen, Minh-Triet Tran, Yukang Lin, Yan Hong, Chuanbiao Song, Siyuan Li, Jun Lan, Zhichao Zhang, Xinyue Li, Wei Sun, Zicheng Zhang, Yunhao Li, Xiaohong Liu, Guangtao Zhai, Zitong Xu, Huiyu Duan, Jiarui Wang, Guangji Ma, Liu Yang, Lu Liu, Qiang Hu, Xiongkuo Min, Zichuan Wang, Zhenchen Tang, Bo Peng, Jing Dong, Fengbin Guan, Zihao Yu, Yiting Lu, Wei Luo, Xin Li, Minhao Lin, Haofeng Chen, Xuanxuan He, Kele Xu, Qisheng Xu, Zijian Gao, Tianjiao Wan, Bo-Cheng Qiu, Chih-Chung Hsu, Chia-ming Lee, Yu-Fan Lin, Bo Yu, Zehao Wang, Da Mu, Mingxiu Chen, Junkang Fang, Huamei Sun, Wending Zhao, Zhiyu Wang, Wang Liu, Weikang Yu, Puhong Duan, Bin Sun, Xudong Kang, Shutao Li, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Jiarong He, Zhishan Qiao, Yongqing Huang, Zewen Chen, Zhe Pang, Juan Wang, Jian Guo, Zhizhuo Shao, Ziyu Feng, Bing Li, Weiming Hu, Hesong Li, Dehua Liu, Zeming Liu, Qingsong Xie, Ruichen Wang, Zhihao Li, Yuqi Liang, Jianqi Bi, Jun Luo, Junfeng Yang, Can Li, Jing Fu, Hongwei Xu, Mingrui Long, Lulin Tang
Title: NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Abstract:
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
中文: NTIRE 2025挑战赛通过图像对齐和结构检测双赛道评估文本生成图像模型,吸引了大量参赛者,其成果超越基线方法,优胜方案在质量评估方面表现出卓越性能。
English: The NTIRE 2025 challenge evaluates text-to-image generation models through alignment and structural tracks, with numerous participants achieving results surpassing baseline methods and top performers demonstrating superior quality assessment capabilities.

Authors:Yixuan Gao, Xiongkuo Min, Guangtao Zhai
Title: Exploring Image Quality Assessment from a New Perspective: Pupil Size
Abstract:
This paper explores how the image quality assessment (IQA) task affects the cognitive processes of people from the perspective of pupil size and studies the relationship between pupil size and image quality. Specifically, we first invited subjects to participate in a subjective experiment, which includes two tasks: free observation and IQA. In the free observation task, subjects did not need to perform any action, and they only needed to observe images as they usually do with an album. In the IQA task, subjects were required to score images according to their overall impression of image quality. Then, by analyzing the difference in pupil size between the two tasks, we find that people may activate the visual attention mechanism when evaluating image quality. Meanwhile, we also find that the change in pupil size is closely related to image quality in the IQA task. For future research on IQA, this research can not only provide a theoretical basis for the objective IQA method and promote the development of more effective objective IQA methods, but also provide a new subjective IQA method for collecting the authentic subjective impression of image quality.
本研究通过瞳孔大小分析探讨图像质量评估任务如何影响认知过程,发现此类任务会激活视觉注意力机制,并表明瞳孔变化与感知图像质量密切相关。
This study investigates how image quality assessment tasks influence cognitive processes through pupil size analysis, revealing that such tasks activate visual attention mechanisms and showing a strong correlation between pupil size changes and perceived image quality.

Authors:Linhan Cao, Wei Sun, Kaiwei Zhang, Yicong Peng, Guangtao Zhai, Xiongkuo Min
Title: Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision
Abstract:
Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbf{iterative self-improvement training strategy}, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset $10\times$ larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.
Chinese: 本文提出了一种用于视频质量评估的自监督学习框架,通过排序学习和迭代自改进策略在大规模未标记视频上训练,实现了最先进的性能并具有卓越的泛化能力。
English: This paper presents a self-supervised learning framework for video quality assessment that trains on large-scale unlabeled videos using learning-to-rank and iterative self-improvement, achieving state-of-the-art performance with superior generalization.

Authors:Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Abbas Jamalipour
Title: Context-Aware Semantic Communication for the Wireless Networks
Abstract:
In next-generation wireless networks, supporting real-time applications such as augmented reality, autonomous driving, and immersive Metaverse services demands stringent constraints on bandwidth, latency, and reliability. Existing semantic communication (SemCom) approaches typically rely on static models, overlooking dynamic conditions and contextual cues vital for efficient transmission. To address these challenges, we propose CaSemCom, a context-aware SemCom framework that leverages a Large Language Model (LLM)-based gating mechanism and a Mixture of Experts (MoE) architecture to adaptively select and encode only high-impact semantic features across multiple data modalities. Our multimodal, multi-user case study demonstrates that CaSemCom significantly improves reconstructed image fidelity while reducing bandwidth usage, outperforming single-agent deep reinforcement learning (DRL) methods and traditional baselines in convergence speed, semantic accuracy, and retransmission overhead.
中文:提出的CaSemCom框架通过基于大语言模型的动态选通与专家混合架构,在多模态场景下实现了比现有方法更优的带宽效率和重建质量,显著提升了语义通信性能。
English: The proposed CaSemCom framework enhances semantic communication by dynamically selecting key features using LLM-based gating and MoE architecture, achieving superior bandwidth efficiency and reconstruction quality in multimodal scenarios compared to existing methods.

Authors:Lingyi Cai, Ruichen Zhang, Changyuan Zhao, Yu Zhang, Jiawen Kang, Dusit Niyato, Tao Jiang, Xuemin Shen
Title: Large Language Model-enhanced Reinforcement Learning for Low-Altitude Economy Networking
Abstract:
Low-Altitude Economic Networking (LAENet) aims to support diverse flying applications below 1,000 meters by deploying various aerial vehicles for flexible and cost-effective aerial networking. However, complex decision-making, resource constraints, and environmental uncertainty pose significant challenges to the development of the LAENet. Reinforcement learning (RL) offers a potential solution in response to these challenges but has limitations in generalization, reward design, and model stability. The emergence of large language models (LLMs) offers new opportunities for RL to mitigate these limitations. In this paper, we first present a tutorial about integrating LLMs into RL by using the capacities of generation, contextual understanding, and structured reasoning of LLMs. We then propose an LLM-enhanced RL framework for the LAENet in terms of serving the LLM as information processor, reward designer, decision-maker, and generator. Moreover, we conduct a case study by using LLMs to design a reward function to improve the learning performance of RL in the LAENet. Finally, we provide a conclusion and discuss future work.
中文: 本文提出了一种利用大型语言模型(LLMs)增强强化学习(RL)的框架,以解决低空经济网络(LAENet)中决策复杂和资源受限等挑战,通过发挥LLMs在生成、推理和奖励设计方面的能力来优化RL性能。
English: This paper introduces a framework that leverages large language models (LLMs) to enhance reinforcement learning (RL) for addressing challenges in Low-Altitude Economic Networking (LAENet), such as decision-making and resource constraints, by utilizing LLMs' capabilities in generation, reasoning, and reward design.

Authors:Minrui Xu, Jiani Fan, Xinyu Huang, Conghao Zhou, Jiawen Kang, Dusit Niyato, Shiwen Mao, Zhu Han, Xuemin, Shen, Kwok-Yan Lam
Title: Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks
Abstract:
With the continuous evolution of Large Language Models (LLMs), LLM-based agents have advanced beyond passive chatbots to become autonomous cyber entities capable of performing complex tasks, including web browsing, malicious code and deceptive content generation, and decision-making. By significantly reducing the time, expertise, and resources, AI-assisted cyberattacks orchestrated by LLM-based agents have led to a phenomenon termed Cyber Threat Inflation, characterized by a significant reduction in attack costs and a tremendous increase in attack scale. To provide actionable defensive insights, in this survey, we focus on the potential cyber threats posed by LLM-based agents across diverse network systems. Firstly, we present the capabilities of LLM-based cyberattack agents, which include executing autonomous attack strategies, comprising scouting, memory, reasoning, and action, and facilitating collaborative operations with other agents or human operators. Building on these capabilities, we examine common cyberattacks initiated by LLM-based agents and compare their effectiveness across different types of networks, including static, mobile, and infrastructure-free paradigms. Moreover, we analyze threat bottlenecks of LLM-based agents across different network infrastructures and review their defense methods. Due to operational imbalances, existing defense methods are inadequate against autonomous cyberattacks. Finally, we outline future research directions and potential defensive strategies for legacy network systems.
中文摘要:基于大语言模型的智能体已发展为能执行复杂网络攻击的自主实体,导致攻击成本降低和规模扩大的网络威胁膨胀现象,而现有防御措施对此类自主攻击仍显不足。
English Summary: LLM-based agents have evolved into autonomous entities capable of executing complex cyberattacks, leading to Cyber Threat Inflation through reduced costs and increased scale, while current defenses remain insufficient against these threats.

Authors:Yingkai Kang, Jiawen Kang, Jinbo Wen, Tao Zhang, Zhaohui Yang, Dusit Niyato, Yan Zhang
Title: Confidence-Regulated Generative Diffusion Models for Reliable AI Agent Migration in Vehicular Metaverses
Abstract:
Vehicular metaverses are an emerging paradigm that merges intelligent transportation systems with virtual spaces, leveraging advanced digital twin and Artificial Intelligence (AI) technologies to seamlessly integrate vehicles, users, and digital environments. In this paradigm, vehicular AI agents are endowed with environment perception, decision-making, and action execution capabilities, enabling real-time processing and analysis of multi-modal data to provide users with customized interactive services. Since vehicular AI agents require substantial resources for real-time decision-making, given vehicle mobility and network dynamics conditions, the AI agents are deployed in RoadSide Units (RSUs) with sufficient resources and dynamically migrated among them. However, AI agent migration requires frequent data exchanges, which may expose vehicular metaverses to potential cyber attacks. To this end, we propose a reliable vehicular AI agent migration framework, achieving reliable dynamic migration and efficient resource scheduling through cooperation between vehicles and RSUs. Additionally, we design a trust evaluation model based on the theory of planned behavior to dynamically quantify the reputation of RSUs, thereby better accommodating the personalized trust preferences of users. We then model the vehicular AI agent migration process as a partially observable markov decision process and develop a Confidence-regulated Generative Diffusion Model (CGDM) to efficiently generate AI agent migration decisions. Numerical results demonstrate that the CGDM algorithm significantly outperforms baseline methods in reducing system latency and enhancing robustness against cyber attacks.
Chinese Summary: 本文提出了一种可靠的车载AI代理迁移框架,通过信任评估模型和生成扩散算法,在车载元宇宙中实现安全高效的动态迁移,显著降低了系统延迟并增强了抵御网络攻击的能力。
English Summary: The paper introduces a reliable vehicular AI agent migration framework that uses a trust evaluation model and a generative diffusion algorithm to enhance security and efficiency in vehicular metaverses, significantly reducing latency and improving resilience against cyber threats.

Authors:Peng Yin, Wentao Liang, Jinbo Wen, Jiawen Kang, Junlong Chen, Dusit Niyato
Title: Multi-Agent DRL for Multi-Objective Twin Migration Routing with Workload Prediction in 6G-enabled IoV
Abstract:
Sixth Generation (6G)-enabled Internet of Vehicles (IoV) facilitates efficient data synchronization through ultra-fast bandwidth and high-density connectivity, enabling the emergence of Vehicle Twins (VTs). As highly accurate replicas of vehicles, VTs can support intelligent vehicular applications for occupants in 6G-enabled IoV. Thanks to the full coverage capability of 6G, resource-constrained vehicles can offload VTs to edge servers, such as roadside units, unmanned aerial vehicles, and satellites, utilizing their computing and storage resources for VT construction and updates. However, communication between vehicles and edge servers with limited coverage is prone to interruptions due to the dynamic mobility of vehicles. Consequently, VTs must be migrated among edge servers to maintain uninterrupted and high-quality services for users. In this paper, we introduce a VT migration framework in 6G-enabled IoV. Specifically, we first propose a Long Short-Term Memory (LSTM)-based Transformer model to accurately predict long-term workloads of edge servers for migration decision-making. Then, we propose a Dynamic Mask Multi-Agent Proximal Policy Optimization (DM-MAPPO) algorithm to identify optimal migration routes in the highly complex environment of 6G-enabled IoV. Finally, we develop a practical platform to validate the effectiveness of the proposed scheme using real datasets. Simulation results demonstrate that the proposed DM-MAPPO algorithm significantly reduces migration latency by 20.82% and packet loss by 75.07% compared with traditional deep reinforcement learning algorithms.
中文: 本文提出了一种6G车联网中的车辆数字孪生迁移框架,采用基于长短期记忆的Transformer模型预测边缘服务器负载,并通过动态掩码多智能体近端策略优化算法确定最优迁移路径,大幅降低了延迟和数据包丢失率。
English: The paper presents a Vehicle Twin migration framework for 6G-enabled Internet of Vehicles, utilizing an LSTM-based Transformer for workload prediction and a DM-MAPPO algorithm to optimize migration routes, achieving significant reductions in latency and packet loss.

Authors:Yuntao Wang, Shaolong Guo, Yanghe Pan, Zhou Su, Fahao Chen, Tom H. Luan, Peng Li, Jiawen Kang, Dusit Niyato
Title: Internet of Agents: Fundamentals, Applications, and Challenges
Abstract:
With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.
中文: 本文提出“智能体互联网”(IoA)作为统一框架,通过分层架构、动态任务匹配等运行机制实现异构智能体的互联协作,并探讨了构建可信生态系统的未来研究方向。
English: The Internet of Agents (IoA) is proposed as a unified framework to interconnect diverse AI agents, enabling their collaboration through hierarchical architecture, operational enablers like dynamic task matching, and addressing future challenges for trustworthy ecosystems.

Authors:Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao
Title: Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer
Abstract:
Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model's parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.
Chinese: 针对离线多任务强化学习难以扩展至大量任务的挑战,本文提出M3DT专家混合框架,通过增强决策变换器结构和优化训练机制,成功将任务扩展至160项并显著提升性能。
English: Recent offline multi-task reinforcement learning struggles with scaling to numerous tasks, prompting the development of M3DT, a mixture-of-experts framework that enhances model scalability and performance across up to 160 tasks through architectural and optimization improvements.

Authors:Jifeng Hu, Sili Huang, Siyuan Guo, Zhaogeng Liu, Li Shen, Lichao Sun, Hechang Chen, Yi Chang, Dacheng Tao
Title: Decision Flow Policy Optimization
Abstract:
In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.
中文摘要:本文提出决策流框架,通过将基于流的模型动作生成过程构建为流决策过程,统一优化多模态动作分布建模与策略改进,在离线强化学习环境中实现与最先进方法相当或更优的性能。
English Summary: This paper introduces Decision Flow, a unified framework that integrates multi-modal action distribution modeling with policy optimization using flow-based models, achieving state-of-the-art performance in offline reinforcement learning by simultaneously optimizing distribution fitting and policy improvement.

Authors:Rong-Cheng Tu, Wenhao Sun, Hanzhe You, Yingjie Wang, Jiaxing Huang, Li Shen, Dacheng Tao
Title: Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval
Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, , using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5\% over existing baselines; on CIRR, it boosts R@1 by 9.6\%; and on CIRCO, it increases mAP@5 by 9.5\%.
Chinese: 本文提出了一种用于零样本组合图像检索的多模态推理代理框架,通过直接合成图像-文本三元组进行训练,避免了依赖文本中介导致的误差传播,在多个基准测试中实现了显著的性能提升。
English: This paper introduces a Multimodal Reasoning Agent (MRA) framework for Zero-Shot Composed Image Retrieval, which bypasses error-prone text intermediaries by directly synthesizing image-text triplets for training, achieving significant performance improvements across multiple benchmarks.

Authors:Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao
Title: OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
Abstract:
Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.
Chinese: 基础模型因训练资源密集更新缓慢,而模型融合通过整合专家模型提升能力并降低成本;本文针对多模态大语言模型提出新基准及去噪融合方法,实现性能平均提升2.48%,证明多模态互补优势。
English: Foundation models update slowly due to high training costs, while model merging combines expert models to enhance capabilities and reduce costs, with this paper introducing a new benchmark for multimodal LLMs and a noise-removing merging method that boosts performance by 2.48%.

Authors:Rong-Cheng Tu, Zhao Jin, Jingyi Liao, Xiao Luo, Yingjie Wang, Li Shen, Dacheng Tao
Title: MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
Abstract:
Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM's semantic alignment strengths with the MLLM's reasoning capabilities.
中文摘要:现有零样本组合图像检索方法通过适配器生成编码器兼容令牌但未直接优化组合意图,而我们提出的MVFT-JI方法利用多模态大语言模型指导的微调和联合推理,通过双训练任务结合视觉语言模型的语义对齐与大语言模型的推理能力来提升检索性能。
English Summary: Existing ZS-CIR methods rely on adapters that produce encoder-compatible tokens without directly optimizing for compositional intent, while our proposed MVFT-JI approach leverages MLLM-guided fine-tuning with dual training tasks and joint inference to enhance retrieval performance by combining VLM alignment with MLLM reasoning.

Authors:Junyu Luo, Yusheng Zhao, Xiao Luo, Zhiping Xiao, Wei Ju, Li Shen, Dacheng Tao, Ming Zhang
Title: Cross-Domain Diffusion with Progressive Alignment for Efficient Adaptive Retrieval
Abstract:
Unsupervised efficient domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, while maintaining low storage cost and high retrieval efficiency. However, existing methods typically fail to address potential noise in the target domain, and directly align high-level features across domains, thus resulting in suboptimal retrieval performance. To address these challenges, we propose a novel Cross-Domain Diffusion with Progressive Alignment method (COUPLE). This approach revisits unsupervised efficient domain adaptive retrieval from a graph diffusion perspective, simulating cross-domain adaptation dynamics to achieve a stable target domain adaptation process. First, we construct a cross-domain relationship graph and leverage noise-robust graph flow diffusion to simulate the transfer dynamics from the source domain to the target domain, identifying lower noise clusters. We then leverage the graph diffusion results for discriminative hash code learning, effectively learning from the target domain while reducing the negative impact of noise. Furthermore, we employ a hierarchical Mixup operation for progressive domain alignment, which is performed along the cross-domain random walk paths. Utilizing target domain discriminative hash learning and progressive domain alignment, COUPLE enables effective domain adaptive hash learning. Extensive experiments demonstrate COUPLE's effectiveness on competitive benchmarks.
中文摘要:提出的COUPLE方法通过图扩散技术降低噪声影响,并采用渐进式对齐实现有效哈希学习,在基准测试中展现出优越的无监督跨域检索性能。
English Summary: The proposed COUPLE method tackles unsupervised domain adaptive retrieval by using graph diffusion to reduce noise impact and progressive alignment for effective hash learning, achieving superior performance on benchmarks.

Authors:Jifeng Hu, Sili Huang, Zhejian Yang, Shengchao Hu, Li Shen, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao
Title: Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning
Abstract:
Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.
中文摘要:提出的解析能量引导策略优化(AEPO)方法通过推导闭式解并训练神经网络,解决了强化学习中扩散模型难以估计中间能量的问题,在离线强化学习基准测试中展现出卓越性能。
English Summary: The proposed Analytic Energy-guided Policy Optimization (AEPO) method overcomes the challenge of estimating intractable intermediate energy in diffusion models for reinforcement learning by deriving closed-form solutions and training neural networks, achieving superior performance in offline RL benchmarks.

Authors:Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Huajun Chen, Ningyu Zhang
Title: Spatial Knowledge Graph-Guided Multimodal Synthesis
Abstract:
Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
Chinese: SKG2Data提出了一种新颖的多模态合成方法,通过空间知识图谱引导生成符合空间常识的数据,有效提升了多模态大语言模型的空间感知与推理能力。
English: SKG2Data introduces a novel multimodal synthesis method guided by spatial knowledge graphs to enhance MLLMs' spatial perception and reasoning by generating data that adheres to spatial common sense.

Authors:Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang
Title: Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Abstract:
Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
中文: 本文提出的引导目标原子(STA)方法通过分离和操控解耦的知识组件,有效提升了语言模型的安全性和可控性,在对抗场景中展现出卓越的鲁棒性,并实现了精确的推理控制。
English: The proposed Steering Target Atoms (STA) method effectively isolates and manipulates disentangled knowledge components to enhance language model safety and control, demonstrating superior robustness in adversarial scenarios and precise reasoning control.

Authors:Da Zheng, Lun Du, Junwei Su, Yuchen Tian, Yuqi Zhu, Jintian Zhang, Lanning Wei, Ningyu Zhang, Huajun Chen
Title: Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey
Abstract:
Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate solutions, make inferences, and even leverage external computational tools. However, applying LLMs to real-world problem-solving presents significant challenges, including multi-step reasoning, domain knowledge integration, and result verification. This survey explores the capabilities and limitations of LLMs in complex problem-solving, examining techniques including Chain-of-Thought (CoT) reasoning, knowledge augmentation, and various LLM-based and tool-based verification techniques. Additionally, we highlight domain-specific challenges in various domains, such as software engineering, mathematical reasoning and proving, data analysis and modeling, and scientific research. The paper further discusses the fundamental limitations of the current LLM solutions and the future directions of LLM-based complex problems solving from the perspective of multi-step reasoning, domain knowledge integration and result verification.
中文: 本综述探讨了大语言模型如何通过思维链推理和知识增强等技术解决复杂问题,同时分析了在跨领域多步推理和结果验证方面面临的挑战。
English: This survey examines how Large Language Models (LLMs) tackle complex problem-solving through techniques like Chain-of-Thought reasoning and knowledge augmentation, while addressing challenges in multi-step reasoning and result verification across various domains.

Authors:Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan
Title: ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
Abstract:
Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available
中文: ThinkGeo推出了首个针对遥感领域的工具增强型大语言模型代理评估基准,通过多步骤规划和多样化工具测试其空间推理能力,揭示了不同模型在复杂任务处理中的显著性能差异。
English: ThinkGeo introduces the first comprehensive benchmark for evaluating tool-augmented LLM agents in remote sensing, testing their spatial reasoning through multi-step planning with diverse tools across real-world applications and revealing significant performance gaps among models.

Authors:Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan
Title: ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
Abstract:
Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery, including both optical RGB and SAR data, and require agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 486 structured agentic tasks with 1,773 expert-verified reasoning steps. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing.
中文: ThinkGeo推出了首个针对遥感领域的工具增强型大语言模型代理评估基准,通过多步骤规划和多样化工具测试其空间推理能力,揭示了不同模型在复杂任务处理中的显著性能差异。
English: ThinkGeo introduces the first comprehensive benchmark for evaluating tool-augmented LLM agents in remote sensing, testing their spatial reasoning through multi-step planning with diverse tools across real-world applications and revealing significant performance gaps among models.

Authors:Xiang Li, Haiyang Yu, Xinghua Zhang, Ziyang Huang, Shizhu He, Kang Liu, Jun Zhao, Fei Huang, Yongbin Li
Title: Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Abstract:
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
过程奖励模型(PRMs)在复杂推理任务中验证中间步骤至关重要,然而现有基准缺乏对不同推理模式的系统评估,因此引入Socratic-PRMBench以填补这一空白并揭示当前PRMs的关键缺陷。
Process Reward Models (PRMs) are essential for verifying intermediate steps in complex reasoning tasks, yet current benchmarks lack systematic evaluation across diverse reasoning patterns, prompting the introduction of Socratic-PRMBench to address this gap and reveal critical PRM deficiencies.

Authors:Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Zehui Xiong, Sumei Sun, Abbas Jamalipour
Title: Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception
Abstract:
The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real-time interaction, decision-making, and collaborative tasks. However, in wireless multi-agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval-Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval-driven semantic refinement tailored for distributed multi-agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi-agent autonomous driving demonstrates that our DRL-based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.
中文:RAMSemCom框架通过检索增强的多模态语义通信结合深度强化学习,在分布式多智能体系统中优化带宽使用并保持语义保真度,自动驾驶案例研究表明其显著提升了任务效率并降低了通信开销。
English: The RAMSemCom framework introduces retrieval-augmented multimodal semantic communication with deep reinforcement learning to optimize bandwidth usage while preserving semantic fidelity in distributed multi-agent systems, as demonstrated by improved efficiency in autonomous driving scenarios.

Authors:Xudong Wang, Jian Zhu, Ruichen Zhang, Lei Feng, Dusit Niyato, Jiacheng Wang, Hongyang Du, Shiwen Mao, Zhu Han
Title: Chain-of-Thought for Large Language Model-empowered Wireless Communications
Abstract:
Recent advances in large language models (LLMs) have opened new possibilities for automated reasoning and decision-making in wireless networks. However, applying LLMs to wireless communications presents challenges such as limited capability in handling complex logic, generalization, and reasoning. Chain-of-Thought (CoT) prompting, which guides LLMs to generate explicit intermediate reasoning steps, has been shown to significantly improve LLM performance on complex tasks. Inspired by this, this paper explores the application potential of CoT-enhanced LLMs in wireless communications. Specifically, we first review the fundamental theory of CoT and summarize various types of CoT. We then survey key CoT and LLM techniques relevant to wireless communication and networking. Moreover, we introduce a multi-layer intent-driven CoT framework that bridges high-level user intent expressed in natural language with concrete wireless control actions. Our proposed framework sequentially parses and clusters intent, selects appropriate CoT reasoning modules via reinforcement learning, then generates interpretable control policies for system configuration. Using the unmanned aerial vehicle (UAV) network as a case study, we demonstrate that the proposed framework significantly outperforms a non-CoT baseline in both communication performance and quality of generated reasoning.
中文:本文提出了一种多层意图驱动的思维链框架,通过增强大语言模型在无线通信中的推理能力,在无人机网络应用中相比非思维链基线展现出更优越的性能。
English: This paper proposes a multi-layer intent-driven Chain-of-Thought framework that enhances large language models' reasoning capabilities for wireless communications, demonstrating superior performance in UAV network applications compared to non-CoT baselines.

Authors:Pengfei Cao, Tianyi Men, Wencan Liu, Jingwen Zhang, Xuzhao Li, Xixun Lin, Dianbo Sui, Yanan Cao, Kang Liu, Jun Zhao
Title: Large Language Models for Planning: A Comprehensive and Systematic Survey
Abstract:
Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.
中文: 本文系统综述了大语言模型在规划任务中的应用,将方法分为外部模块增强、微调与搜索三类,并总结了评估框架及未来研究方向。
English: This paper provides a comprehensive survey of LLM-based planning, categorizing methods into external module augmentation, finetuning, and search-based approaches while summarizing evaluation frameworks and future research directions.

Authors:Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion
Abstract:
Recently, the concept of ``compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the ``Compression Hacking'' in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM's comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.
中文总结:该研究揭示语言模型中的“压缩黑客”现象通过牺牲空间均匀性产生误导性高压缩率,并提出结合几何畸变的改进指标,这些指标与模型综合能力高度相关。
English Summary: The study reveals that "compression hacking" in language models creates misleading high compression rates by sacrificing spatial uniformity, and proposes refined metrics incorporating geometric distortion that strongly correlate with model performance.

Authors:Yao Xu, Mingyu Xu, Fangyu Lei, Wangtao Sun, Xiangrong Zeng, Bingning Wang, Guang Liu, Shizhu He, Jun Zhao, Kang Liu
Title: Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN
Abstract:
Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at https://anonymous.4open.science/r/Shift-FFN
中文:近期模型通过长思维链在复杂推理中表现卓越,但微调易引发循环推理问题,而Shift-FFN通过增强相邻标记表征差异有效缓解此现象,在多项数学推理任务中显著提升准确率。
English: Recent models excel in complex reasoning via Long Chain-of-Thought, but fine-tuning often causes Cyclical Reasoning, which Shift-FFN mitigates by amplifying token representation differences, improving accuracy across mathematical tasks.

Authors:Enze Liu, Bowen Zheng, Xiaolei Wang, Wayne Xin Zhao, Jinpeng Wang, Sheng Chen, Ji-Rong Wen
Title: LARES: Latent Reasoning for Sequential Recommendation
Abstract:
Sequential recommender systems have become increasingly important in real-world applications that model user behavior sequences to predict their preferences. However, existing sequential recommendation methods predominantly rely on non-reasoning paradigms, which may limit the model's computational capacity and result in suboptimal recommendation performance. To address these limitations, we present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation that enhances model's representation capabilities through increasing the computation density of parameters by depth-recurrent latent reasoning. Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity, thereby effectively capturing dynamic and intricate user interest patterns. A key difference of LARES lies in refining all input tokens at each implicit reasoning step to improve the computation utilization. To fully unlock the model's reasoning potential, we design a two-phase training strategy: (1) Self-supervised pre-training (SPT) with dual alignment objectives; (2) Reinforcement post-training (RPT). During the first phase, we introduce trajectory-level alignment and step-level alignment objectives, which enable the model to learn recommendation-oriented latent reasoning patterns without requiring supplementary annotated data. The subsequent phase utilizes reinforcement learning (RL) to harness the model's exploratory ability, further refining its reasoning capabilities. Comprehensive experiments on real-world benchmarks demonstrate our framework's superior performance. Notably, LARES exhibits seamless compatibility with existing advanced models, further improving their recommendation performance. Our code is available at https://anonymous.4open.science/r/LARES-E458/.
中文摘要:LARES是一种新颖的潜在推理序列推荐框架,通过深度循环潜在推理增强计算密度,并采用两阶段训练策略在不增加参数复杂度的前提下提升推荐性能。
English Summary: LARES is a novel latent reasoning framework for sequential recommendation that enhances computational density through depth-recurrent reasoning and employs a two-phase training strategy to improve recommendation performance without increasing parameter complexity.

Authors:Bowen Zheng, Xiaolei Wang, Enze Liu, Xi Wang, Lu Hongyu, Yu Chen, Wayne Xin Zhao, Ji-Rong Wen
Title: DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation
Abstract:
Recently, large language models (LLMs) have been introduced into recommender systems (RSs), either to enhance traditional recommendation models (TRMs) or serve as recommendation backbones. However, existing LLM-based RSs often do not fully exploit the complementary advantages of LLMs (e.g., world knowledge and reasoning) and TRMs (e.g., recommendation-specific knowledge and efficiency) to fully explore the item space. To address this, we propose DeepRec, a novel LLM-based RS that enables autonomous multi-turn interactions between LLMs and TRMs for deep exploration of the item space. In each interaction turn, LLMs reason over user preferences and interact with TRMs to retrieve candidate items. After multi-turn interactions, LLMs rank the retrieved items to generate the final recommendations. We adopt reinforcement learning(RL) based optimization and propose novel designs from three aspects: recommendation model based data rollout, recommendation-oriented hierarchical rewards, and a two-stage RL training strategy. For data rollout, we introduce a preference-aware TRM, with which LLMs interact to construct trajectory data. For rewards, we design a hierarchical reward function that involves both process-level and outcome-level rewards to optimize the interaction process and recommendation performance, respectively. For RL training, we develop a two-stage training strategy, where the first stage aims to guide LLMs to interact with TRMs and the second stage focuses on performance improvement. Experiments on public datasets demonstrate that DeepRec significantly outperforms both traditional and LLM-based baselines, offering a new paradigm for deep exploration in recommendation systems.
中文摘要:DeepRec提出了一种新型推荐系统,通过大语言模型与传统推荐模型之间的自主多轮交互,结合强化学习优化策略,实现了对物品空间的深度探索,显著超越了现有推荐方法的性能。
English Summary: DeepRec introduces a novel recommender system that enables autonomous multi-turn interactions between large language models and traditional recommendation models, using reinforcement learning optimization to deeply explore the item space and significantly outperform existing methods.

Authors:Kaiyu He, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Title: Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models
Abstract:
Large language models (LLMs) demonstrate remarkable ability in cross-lingual tasks. Understanding how LLMs acquire this ability is crucial for their interpretability. To quantify the cross-lingual ability of LLMs accurately, we propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn cross-lingual ability, we trace the outputs of LLMs' intermediate layers in the word translation task. We identify and distinguish two distinct behaviors in the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior. We attribute LLMs' two distinct behaviors to the co-occurrence frequency of words and find the semantic pivot from the pre-training dataset. Finally, to apply our findings to improve the cross-lingual ability of LLMs, we reconstruct a semantic pivot-aware pre-training dataset using documents with a high proportion of semantic pivots. Our experiments validate the effectiveness of our approach in enhancing cross-lingual ability. Our research contributes insights into the interpretability of LLMs and offers a method for improving LLMs' cross-lingual ability.
中文: 本研究通过词级翻译任务量化大语言模型的跨语言能力,识别出受词频影响的共现和语义枢纽两种行为,并通过重构高语义枢纽比例的预训练数据集来增强该能力。
English: This study quantifies LLMs' cross-lingual ability through word-level translation tasks, identifying co-occurrence and semantic pivot behaviors influenced by word frequency, and enhances this ability by reconstructing training datasets with high semantic pivot proportions.

Authors:Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu
Title: Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention
Abstract:
Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.
中文: 提出的HyCo$_2$方法融合全局与局部压缩策略,在减少88.8%令牌消耗的同时,显著提升大语言模型在知识密集型任务中的推理性能。
English: The proposed HyCo$_2$ method integrates global and local compression to enhance long-text reasoning in LLMs, significantly reducing token usage while improving performance on knowledge-intensive benchmarks.

Authors:Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Title: Game-RL: Synthesizing Verifiable Game Tasks at Scale to Boost VLMs General Reasoning
Abstract:
Real-world vision language reasoning scenarios often include diverse and complex tasks. However, vision language reinforcement learning has primarily focused on a narrow set of tasks (e.g. geometry or chart reasoning), limiting the improvement of Vision Language Models' (VLMs) general reasoning. Therefore, we propose a novel Code2Logic approach, using Large Language Models (LLMs) to synthesize verifiable game reasoning tasks at scale via adapting game code. Using the Code2Logic, we developed the GameQA dataset to train and evaluate VLMs. GameQA is verifiable and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Then we apply Game-RL, which is simple reinforcement learning on GameQA. Surprisingly, despite training solely on game tasks, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at the GitHub repository.
中文: 视觉语言强化学习以往局限于狭窄领域,而提出的Game-RL框架利用电子游戏丰富的视觉元素和可验证奖励,通过Code2Logic生成的GameQA数据集增强视觉语言模型的通用推理能力,在多个基准测试中实现了性能提升。
English: Vision-language reinforcement learning has been limited to narrow domains, but the proposed Game-RL framework leverages video games' rich visual elements and verifiable rewards to enhance VLMs' general reasoning ability through the GameQA dataset synthesized by Code2Logic, achieving performance gains across multiple benchmarks.

Authors:Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Title: Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
Abstract:
Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.
中文: 视觉语言强化学习以往局限于狭窄领域,而提出的Game-RL框架利用电子游戏丰富的视觉元素和可验证奖励,通过Code2Logic生成的GameQA数据集增强视觉语言模型的通用推理能力,在多个基准测试中实现了性能提升。
English: Vision-language reinforcement learning has been limited to narrow domains, but the proposed Game-RL framework leverages video games' rich visual elements and verifiable rewards to enhance VLMs' general reasoning ability through the GameQA dataset synthesized by Code2Logic, achieving performance gains across multiple benchmarks.

Authors:Jiakuan Xie, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Title: Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing
Abstract:
Knowledge editing, which aims to update the knowledge encoded in language models, can be deceptive. Despite the fact that many existing knowledge editing algorithms achieve near-perfect performance on conventional metrics, the models edited by them are still prone to generating original knowledge. This paper introduces the concept of "superficial editing" to describe this phenomenon. Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms. Through systematic investigation, we identify and validate two key factors contributing to this issue: (1) the residual stream at the last subject position in earlier layers and (2) specific attention modules in later layers. Notably, certain attention heads in later layers, along with specific left singular vectors in their output matrices, encapsulate the original knowledge and exhibit a causal relationship with superficial editing. Furthermore, we extend our analysis to the task of superficial unlearning, where we observe consistent patterns in the behavior of specific attention heads and their corresponding left singular vectors, thereby demonstrating the robustness and broader applicability of our methodology and conclusions. Our code is available here.
中文摘要:知识编辑常导致语言模型仅进行表面更新,早期层的残差流和后期层的特定注意力机制使模型仍倾向于生成原始知识,这对现有算法构成重大挑战。
English Summary: Knowledge editing in language models often results in superficial updates, where models revert to original knowledge due to residual streams in early layers and specific attention mechanisms in later layers, revealing significant challenges for current algorithms.

Authors:Ziyang Huang, Wangtao Sun, Jun Zhao, Kang Liu
Title: Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate
Abstract:
This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R$^3$), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.
中文: 本文提出自归纳增强检索(SIAR)方法,利用大语言模型从查询中归纳推理规则以改进检索效果,并结合规则相关性重估(R³)方法评估规则适用性,在多场景实验中显著提升了推理性能。
English: This paper introduces Self-Induction Augmented Retrieval (SIAR), which uses LLMs to generate inferential rules from queries for enhanced retrieval, and Rule Relevance ReEstimate (R³) to reassess rule relevance, significantly improving reasoning performance across diverse settings.

Authors:Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
Title: WorldPM: Scaling Human Preference Modeling
Abstract:
Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.
中文: 本研究提出世界偏好建模(WorldPM),发现偏好建模与语言建模类似遵循规模定律,对抗性和客观性指标随数据和模型规模扩大持续提升,而主观性指标无此规律,WorldPM能显著提升不同规模偏好数据集的泛化性能并在强化学习人类反馈流程中取得显著改进。
English: The study introduces World Preference Modeling (WorldPM), demonstrating that preference modeling follows scaling laws similar to language modeling, with adversarial and objective metrics showing consistent improvement with increased data and model size, while subjective metrics do not scale, and WorldPM significantly enhances generalization across various preference datasets and RLHF pipelines.

Authors:Ziyang Huang, Xiaowei Yuan, Yiming Ju, Jun Zhao, Kang Liu
Title: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
Abstract:
Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.
中文摘要:本文提出IKEA强化智能体,通过优先利用内部知识并在知识不足时启动外部检索,结合新型奖励机制显著减少冗余搜索并提升多任务推理性能。
English Summary: The paper introduces IKEA, a reinforced agent that optimizes retrieval-augmented generation by prioritizing internal knowledge and activating external searches only when necessary, using a novel reward function to reduce redundancies and enhance reasoning across tasks.

Authors:Baoxia Du, Hongyang Du, Dusit Niyato, Ruidong Li
Title: Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks
Abstract:
Task-oriented semantic communication has emerged as a fundamental approach for enhancing performance in various communication scenarios. While recent advances in Generative Artificial Intelligence (GenAI), such as Large Language Models (LLMs), have been applied to semantic communication designs, the potential of Large Multimodal Models (LMMs) remains largely unexplored. In this paper, we investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA) and propose a task-oriented semantic communication framework to facilitate efficient interaction between users and cloud servers. To reduce computational demands and shorten response time, we optimize LLaVA's image slicing to selectively focus on areas of utmost interest to users. Additionally, we assess the importance of image patches by combining objective and subjective user attention, adjusting energy usage for transmitting semantic information. This strategy optimizes resource utilization, ensuring precise transmission of critical information. We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness. Experimental results show that our semantic communication framework significantly increases accuracy in answering questions under the same channel conditions, performing particularly well in environments with poor Signal-to-Noise Ratios (SNR). Accuracy can be improved by 13.4% at an SNR of 12dB and 33.1% at 10dB, respectively.
中文: 本文提出了一种基于大型多模态模型的任务导向语义通信框架,通过优化图像处理和能量分配,在交通场景中显著提升了问答准确性,尤其在低信噪比环境下表现优异。
English: This paper introduces a task-oriented semantic communication framework utilizing a Large Multimodal Model-based vehicle AI assistant, which optimizes image processing and energy allocation to enhance question-answering accuracy in traffic scenarios, achieving significant improvements under low SNR conditions.

Authors:Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
Abstract:
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
中文: 本文提出VGBench和SpatialScore基准来评估多模态大语言模型的三维空间理解能力,并开发了SpatialAgent系统,该系统在解决空间推理挑战方面展现出显著效果。
English: This paper introduces VGBench and SpatialScore benchmarks to evaluate multimodal large language models' 3D spatial understanding, and proposes SpatialAgent system which demonstrates effectiveness in addressing spatial reasoning challenges.

Authors:Xiechi Zhang, Zetian Ouyang, Linlin Wang, Gerard de Melo, Zhu Cao, Xiaoling Wang, Ya Zhang, Yanfeng Wang, Liang He
Title: AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Abstract:
With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.
中文: 该摘要介绍了AutoMedEval,一个开源的130亿参数模型,旨在通过分层训练和迭代知识内省自动评估医学大语言模型,减少对人类评估的依赖,并在与人类判断的相关性上优于其他基线方法。
English: The abstract introduces AutoMedEval, an open-source 13B-parameter model designed to automatically evaluate medical large language models by reducing reliance on human assessment through hierarchical training and iterative knowledge introspection, achieving superior correlation with human judgments.

Authors:Jiazheng Zhang, Wenqing Jing, Zizhuo Zhang, Zhiheng Xi, Shihan Dou, Rongxiang Weng, Jiahuan Li, Jingang Wang, Mingxu Chai, Shibo Hong, Tao Gui, Qi Zhang
Title: Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
Abstract:
Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human values. However, noisy preferences in human feedback can lead to reward misgeneralization - a phenomenon where reward models learn spurious correlations or overfit to noisy preferences, which poses important challenges to the generalization of RMs. This paper systematically analyzes the characteristics of preference pairs and aims to identify how noisy preferences differ from human-aligned preferences in reward modeling. Our analysis reveals that noisy preferences are difficult for RMs to fit, as they cause sharp training fluctuations and irregular gradient updates. These distinctive dynamics suggest the feasibility of identifying and excluding such noisy preferences. Empirical studies demonstrate that policy LLM optimized with a reward model trained on the full preference dataset, which includes substantial noise, performs worse than the one trained on a subset of exclusively high quality preferences. To address this challenge, we propose an online Collaborative Reward Modeling (CRM) framework to achieve robust preference learning through peer review and curriculum learning. In particular, CRM maintains two RMs that collaboratively filter potential noisy preferences by peer-reviewing each other's data selections. Curriculum learning synchronizes the capabilities of two models, mitigating excessive disparities to promote the utility of peer review. Extensive experiments demonstrate that CRM significantly enhances RM generalization, with up to 9.94 points improvement on RewardBench under an extreme 40\% noise. Moreover, CRM can seamlessly extend to implicit-reward alignment methods, offering a robust and versatile alignment strategy.
中文: 奖励模型常因嘈杂的人类偏好而产生错误泛化,但新的协同奖励建模框架通过同行评审和课程学习有效过滤噪声,显著提升了泛化能力。
English: Reward models often misgeneralize due to noisy human preferences, but a new Collaborative Reward Modeling framework effectively filters out noise through peer review and curriculum learning to significantly improve generalization.

Authors:Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: Multi-Agent System for Comprehensive Soccer Understanding
Abstract:
Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Concretely, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K multimodal (text, image, video) multi-choice QA pairs across 13 distinct tasks; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and comparisons with representative MLLMs on SoccerBench highlight the superiority of our agentic system.
中文: 本文提出了一个全面的足球理解框架,包含多模态知识库SoccerWiki、含1万问答对的综合基准SoccerBench,以及通过协同推理展现卓越性能的多智能体系统SoccerAgent。
English: This paper introduces a holistic soccer understanding framework featuring SoccerWiki, a multimodal knowledge base; SoccerBench, a comprehensive benchmark with 10K QA pairs; and SoccerAgent, a multi-agent system that demonstrates superior performance through collaborative reasoning.

Authors:Masaki Murooka, Kei Okada, Masayuki Inaba
Title: Optimization-based Posture Generation for Whole-body Contact Motion by Contact Point Search on the Body Surface
Abstract:
Whole-body contact is an effective strategy for improving the stability and efficiency of the motion of robots. For robots to automatically perform such motions, we propose a posture generation method that employs all available surfaces of the robot links. By representing the contact point on the body surface by two-dimensional configuration variables, the joint positions and contact points are simultaneously determined through a gradient-based optimization. By generating motions with the proposed method, we present experiments in which robots manipulate objects effectively utilizing whole-body contact.
中文: 该研究提出一种姿态生成方法,使机器人能够利用所有可用的连杆表面进行全身接触,通过基于梯度的优化同时确定关节位置和接触点,从而提升运动稳定性和效率。
English: The study introduces a posture generation method that enables robots to use all available link surfaces for whole-body contact, optimizing joint positions and contact points through gradient-based techniques to enhance motion stability and efficiency.

Authors:Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Title: Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts
Abstract:
Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
中文: 本文提出LayerMoE算法,通过分析不同层级的语言表征相似度来分配专家数量——相似度越高专家越少,并添加分类器引导旧语言路由,在单次和持续扩展场景下分别以60%和33.3%更少的专家数量超越了现有最佳方法。
English: This paper introduces LayerMoE, a layer-wise expert allocation algorithm that optimizes the expansion of new languages in multilingual large language models by allocating fewer experts to layers with higher cross-lingual similarity and adding a classifier to mitigate forgetting of old languages, achieving superior performance with significantly fewer parameters.

Authors:Yunlong Liang, Fandong Meng, Jiaan Wang, Jie Zhou
Title: SlangDIT: Benchmarking LLMs in Interpretative Slang Translation
Abstract:
The challenge of slang translation lies in capturing context-dependent semantic extensions, as slang terms often convey meanings beyond their literal interpretation. While slang detection, explanation, and translation have been studied as isolated tasks in the era of large language models (LLMs), their intrinsic interdependence remains underexplored. The main reason is lacking of a benchmark where the two tasks can be a prerequisite for the third one, which can facilitate idiomatic translation. In this paper, we introduce the interpretative slang translation task (named SlangDIT) consisting of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context, aiming to generate more accurate translation with the help of slang detection and slang explanation. To this end, we construct a SlangDIT dataset, containing over 25k English-Chinese sentence pairs. Each source sentence mentions at least one slang term and is labeled with corresponding cross-lingual slang explanation. Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning. Further, the SlangOWL provides the best explanation of the slang term targeting on the current context. Finally, according to the whole thought, the SlangOWL offers a suitable translation. Our experiments on LLMs (\emph{e.g.}, Qwen2.5 and LLama-3.1), show that our deep thinking approach indeed enhances the performance of LLMs where the proposed SLangOWL significantly surpasses the vanilla models and supervised fine-tuned models without thinking.
中文: 本文提出了SlangDIT基准,通过整合俚语检测、解释和翻译任务来提升翻译准确性,并开发了SlangOWL深度思考模型,显著增强了大型语言模型在俚语翻译中的表现。
English: The paper introduces SlangDIT, a benchmark for interpretative slang translation that integrates detection, explanation, and translation to improve accuracy, and proposes SlangOWL, a deep thinking model that significantly enhances LLM performance in this task.

Authors:Yunlong Liang, Fandong Meng, Jie Zhou
Title: THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation
Abstract:
The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.
中文: THOR-MoE通过引入分层任务引导路由和上下文感知专家选择机制,改进了稀疏专家混合模型在神经机器翻译中的性能,在多领域和多语言翻译任务中均表现出优越性,并能兼容现有主流路由方案。
English: THOR-MoE enhances sparse Mixture-of-Experts for neural machine translation by introducing hierarchical task-guided routing and context-responsive expert selection, achieving superior performance across multi-domain and multilingual benchmarks while maintaining broad compatibility with existing routing schemes.

Authors:Jiaan Wang, Fandong Meng, Jie Zhou
Title: ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning
Abstract:
In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.
中文: 本研究针对大型推理模型在机器翻译中的应用,提出了一种基于DeepSeek-R1-671B参考对比的新型奖励建模方法,不仅实现了文学翻译的最优性能,还通过轻量级强化学习成功将单语翻译能力扩展到90个多语言翻译方向。
English: Recent advances in large reasoning models (LRMs) have been applied to neural machine translation, with this study introducing a novel reward modeling method using DeepSeek-R1-671B as reference, achieving state-of-the-art literary translation performance and successfully extending to 90 multilingual translation directions through lightweight reinforcement learning.

Authors:Jiaan Wang, Fandong Meng, Zengkui Sun, Yunlong Liang, Yuxuan Cao, Jiarong Xu, Haoxiang Shi, Jie Zhou
Title: An Empirical Study of Many-to-Many Summarization with Large Language Models
Abstract:
Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.
中文: 本研究系统评估了大语言模型的多对多摘要能力,发现指令调优的开源模型在自动评估中优于零样本模型,但在实际应用中需加强事实准确性控制。
English: This study systematically evaluates large language models' many-to-many summarization capabilities, finding that instruction-tuned open-source models outperform zero-shot models in automated metrics but require improved factuality control for practical applications.

Authors:Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Guanbo Wang, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang
Title: ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
Abstract:
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs, increasing computational overhead. Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to remove redundant content thoroughly. To address these limitations, this work begins by framing two key patterns of redundant reflection in LRMs--Confidence Deficit, wherein the model reflects on correct intermediate steps, and Termination Delay, where reflection continues after a verified, confident answer--through a confidence-guided perspective. Based on this, we introduce ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that compared to baseline methods, fine-tuning LRMs on ConCISE-generated data yields a better balance between compression and task performance, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy.
中文:大型推理模型常产生冗长输出,而ConCISE框架通过置信度引导的压缩方法,在保持高精度的同时将推理链长度减少高达50%。
English: Large Reasoning Models often produce verbose outputs, but the ConCISE framework addresses this by using confidence-guided compression to reduce reasoning chain length by up to 50% while maintaining high accuracy.

Authors:Yufei Yin, Lechao Cheng, Wengang Zhou, Jiajun Deng, Zhou Yu, Houqiang Li
Title: Self-Classification Enhancement and Correction for Weakly Supervised Object Detection
Abstract:
In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network's discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.
中文: 本文提出了一种新颖的弱监督目标检测框架,通过引入类内二元分类来增强样本区分度,并在推理阶段采用自校正算法减少误判,在VOC数据集上取得了优越性能。
English: This paper introduces a novel weakly supervised object detection framework that addresses classification ambiguities by integrating intra-class binary classification to enhance discrimination and a self-correction algorithm during inference to reduce misclassifications, achieving superior results on VOC datasets.

Authors:Ruopei Sun, Jianfeng Cai, Jinhua Zhu, Kangwen Zhao, Dongyun Xue, Wengang Zhou, Li Li, Houqiang Li
Title: Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks
Abstract:
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model's semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.
中文:RLHF是使人工智能与人类偏好对齐的主要方法,但在处理复杂多指令任务时表现不足,因此我们提出了多级感知偏好学习(MAPL)框架,利用被忽视的样本内和样本间偏好信号来提升性能,同时保持语义质量。
English: RLHF is a leading method for aligning AI with human preferences but struggles with complex multi-instruction tasks, prompting the development of the Multi-level Aware Preference Learning (MAPL) framework that leverages overlooked intra- and inter-sample preference signals to enhance performance without sacrificing semantic quality.

Authors:Kangwen Zhao, Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Dongyun Xue, Wengang Zhou, Li Li, Houqiang Li
Title: Bias Fitting to Mitigate Length Bias of Reward Model in RLHF
Abstract:
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.
Chinese: 针对人类反馈强化学习中常见的奖励破解问题,如长度偏差,我们提出FiMi-RM框架,通过自主学习并修正非线性偏差模式,在不影响性能的前提下实现更均衡的长度-奖励分布。
English: Reinforcement Learning from Human Feedback often suffers from reward hacking, such as length bias, so we propose FiMi-RM to autonomously learn and correct non-linear bias patterns, achieving more balanced performance without compromising quality.

Authors:Jianfeng Cai, Wengang Zhou, Zongmeng Zhang, Jiale Hong, Nianji Zhan, Houqiang Li
Title: Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding.However, hallucination, where the model generates plausible yet incorrect outputs, persists as a significant and under-addressed challenge in the video domain. Among existing solutions, activation engineering has proven successful in mitigating hallucinations in LLMs and ImageLLMs, yet its applicability to VideoLLMs remains largely unexplored. In this work, we are the first to systematically investigate the effectiveness and underlying mechanisms of activation engineering for mitigating hallucinations in VideoLLMs. We initially conduct an investigation of the key factors affecting the performance of activation engineering and find that a model's sensitivity to hallucination depends on $\textbf{temporal variation}$ rather than task type. Moreover, selecting appropriate internal modules and dataset for activation engineering is critical for reducing hallucination. Guided by these findings, we propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules based on the temporal variation characteristic, substantially mitigating hallucinations without additional LLM fine-tuning. Experiments across multiple models and benchmarks demonstrate that our method markedly reduces hallucination in VideoLLMs, thereby validating the robustness of our findings.
Chinese Summary: 本研究提出了一种时序感知的激活工程框架,通过基于时序变化自适应定位幻觉敏感模块,有效减少视频大语言模型中的幻觉现象,且无需额外微调。
English Summary: This study introduces a temporal-aware activation engineering framework that effectively mitigates hallucinations in VideoLLMs by adaptively targeting hallucination-sensitive modules based on temporal variation, eliminating the need for additional fine-tuning.

Authors:Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li
Title: How does Transformer Learn Implicit Reasoning?
Abstract:
Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly -- producing correct answers without explicitly verbalizing intermediate steps -- but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.
中文: 最新研究表明,大型语言模型通过三阶段学习过程形成隐式多跳推理能力,其中隐藏表征的聚类模式与推理成功相关,为模型可解释性提供了新视角。
English: Recent research reveals that large language models develop implicit multi-hop reasoning through a three-stage learning process, where hidden representations form clustered patterns that correlate with reasoning success, offering new insights into model interpretability.

Authors:Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, Tat-Seng Chua
Title: Are Reasoning Models More Prone to Hallucination?
Abstract:
Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.
中文:近期大型推理模型虽展现出强大的思维链能力,但在事实性任务中产生幻觉的情况存在争议,研究表明完整后训练流程可缓解该问题,而思维重复、答案不匹配及不确定性错位是导致幻觉的关键机制。
English: Recent large reasoning models demonstrate strong chain-of-thought capabilities but exhibit inconsistent hallucination patterns in fact-seeking tasks, with our analysis revealing that proper training pipelines mitigate hallucinations while identifying flawed reasoning behaviors and uncertainty misalignment as key factors.

Authors:Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li
Title: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Abstract:
Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our hard negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing hard negative strategies for geometric reasoning tasks.
Chinese: 本文提出MMCLIP框架,通过结合图像和文本的困难负样本对比学习增强视觉编码器对几何细节的捕捉能力,最终训练的MMGeoLM模型在几何推理基准上显著优于开源模型,并能与GPT-4o相媲美。
English: This paper introduces MMCLIP, a hard negative contrastive learning framework that enhances vision encoders by combining image-based and text-based negatives to improve fine-grained geometric reasoning, resulting in the MMGeoLM model which outperforms open-source models and rivals GPT-4o on benchmarks.

Authors:Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, Juanzi Li
Title: AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
Abstract:
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.
中文:本文提出了首个系统评估大语言模型在智能体场景中遵循指令能力的基准AgentIF,发现现有模型在处理现实应用中冗长复杂的约束时表现欠佳。
English: This paper introduces AgentIF, the first benchmark designed to systematically evaluate the instruction-following abilities of Large Language Models in agentic scenarios, revealing that current models struggle with lengthy and complex constraints despite their real-world applications.

Authors:Yinqiu Liu, Guangyuan Liu, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Geng Sun, Zehui Xiong, Zhu Han
Title: LAMeTA: Intent-Aware Agentic Network Optimization via a Large AI Model-Empowered Two-Stage Approach
Abstract:
Nowadays, Generative AI (GenAI) reshapes numerous domains by enabling machines to create content across modalities. As GenAI evolves into autonomous agents capable of reasoning, collaboration, and interaction, they are increasingly deployed on network infrastructures to serve humans automatically. This emerging paradigm, known as the agentic network, presents new optimization challenges due to the demand to incorporate subjective intents of human users expressed in natural language. Traditional generic Deep Reinforcement Learning (DRL) struggles to capture intent semantics and adjust policies dynamically, thus leading to suboptimality. In this paper, we present LAMeTA, a Large AI Model (LAM)-empowered Two-stage Approach for intent-aware agentic network optimization. First, we propose Intent-oriented Knowledge Distillation (IoKD), which efficiently distills intent-understanding capabilities from resource-intensive LAMs to lightweight edge LAMs (E-LAMs) to serve end users. Second, we develop Symbiotic Reinforcement Learning (SRL), integrating E-LAMs with a policy-based DRL framework. In SRL, E-LAMs translate natural language user intents into structured preference vectors that guide both state representation and reward design. The DRL, in turn, optimizes the generative service function chain composition and E-LAM selection based on real-time network conditions, thus optimizing the subjective Quality-of-Experience (QoE). Extensive experiments conducted in an agentic network with 81 agents demonstrate that IoKD reduces mean squared error in intent prediction by up to 22.5%, while SRL outperforms conventional generic DRL by up to 23.5% in maximizing intent-aware QoE.
中文:LAMeTA提出了一种两阶段方法,利用大型AI模型优化代理网络,通过将意图理解能力提炼至轻量级边缘模型,并将其与强化学习相结合实现动态策略调整,在意图预测和用户体验方面取得了显著提升。
English: LAMeTA introduces a two-stage approach using large AI models to optimize agentic networks by distilling intent understanding into lightweight edge models and integrating them with reinforcement learning for dynamic policy adjustments, achieving significant improvements in intent prediction and user experience.

Authors:Rui Li, Zixuan Hu, Wenxi Qu, Jinouwen Zhang, Zhenfei Yin, Sha Zhang, Xuantuo Huang, Hanqing Wang, Tai Wang, Jiangmiao Pang, Wanli Ouyang, Lei Bai, Wangmeng Zuo, Ling-Yu Duan, Dongzhan Zhou, Shixiang Tang
Title: LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents
Abstract:
Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments. We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research.
中文: LabUtopia作为一套综合性模拟与基准测试平台,旨在推动实验室环境中科学具身智能体的发展,集成了高保真模拟器、可扩展场景生成器和多层次基准测试,以应对复杂实验任务。
English: LabUtopia is introduced as a comprehensive simulation and benchmarking suite to advance scientific embodied agents in laboratories, featuring a high-fidelity simulator, procedural scene generator, and hierarchical benchmark for complex tasks.

Authors:Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
Title: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Abstract:
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
中文摘要:本研究揭示了策略熵崩溃是强化学习在大语言模型推理应用中规模化的主要障碍,并通过限制高协方差标记更新的Clip-Cov和KL-Cov两种技术有效管理熵值,从而增强探索能力并提升下游任务表现。
English Summary: This study identifies policy entropy collapse as a key barrier in scaling reinforcement learning for reasoning with large language models and proposes two techniques—Clip-Cov and KL-Cov—that effectively manage entropy by restricting updates to high-covariance tokens, thereby enhancing exploration and improving performance.

Authors:Wanghan Xu, Wenlong Zhang, Fenghua Ling, Ben Fei, Yusong Hu, Fangxuan Ren, Jintai Lin, Wanli Ouyang, Lei Bai
Title: Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System
Abstract:
Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: https://black-yt.github.io/meta-analysis-page/ .
中文摘要:本文提出Manalyzer多智能体系统,通过工具调用实现端到端元分析自动化,其混合审查、分层提取等策略显著缓解了文献筛选与数据提取中的幻觉问题,并在多领域实验中展现出优越性能。
English Summary: This paper introduces Manalyzer, a multi-agent system that automates end-to-end meta-analysis through tool integration, effectively addressing hallucinations in paper screening and data extraction while demonstrating superior performance across diverse research domains.

Authors:Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu
Title: The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
Abstract:
Proprietary giants are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers -- a simple recipe that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter -- the number of clusters.
中文摘要:复仇者联盟方法通过聚类和投票机制整合多个小型开源语言模型,在数学、编程等15个任务中超越GPT-4等商业模型,尤其在数学任务上提升达18.21%。
English Summary: The Avengers method combines multiple smaller open-source language models through clustering and voting to outperform proprietary giants like GPT-4 across diverse tasks, achieving up to 18% improvement in mathematics.

Authors:Xiaohui Wang, Peng Ye, Chenyu Huang, Shenghe Zheng, Bo Zhang, Lei Bai, Wanli Ouyang, Tao Chen
Title: Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression
Abstract:
With the rise of the fine-tuned--pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 133x, (b) general NLP models (RoBERTa-base, T5-base) with up to 800x, (c) vision models (ViT-B/32, ViT-L/14) with up to 400x, and (d) multi-modal models (BEiT-3) with 40x compression ratio, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression.
中文: UltraDelta提出了一种无需数据的增量压缩流程,通过最小化冗余和最大化层间、层内及全局维度的信息,实现了超高压缩比和强大性能。
English: UltraDelta introduces a data-free delta compression pipeline that achieves ultra-high compression and strong performance by minimizing redundancy and maximizing information across inter-layer, intra-layer, and global dimensions.

Authors:Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, Shaohui Lin
Title: CompBench: Benchmarking Complex Instruction-guided Image Editing
Abstract:
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems. The dataset, code, and models are available in https://comp-bench.github.io/.
中文:CompBench是一个专为复杂指令引导图像编辑设计的大规模基准,通过细粒度指令遵循和空间上下文推理的挑战性场景,揭示了当前模型的根本局限性,并为下一代系统发展提供了关键洞见。
English: CompBench is a large-scale benchmark designed to address the limitations of existing image editing evaluations by introducing complex scenarios requiring fine-grained instruction following and spatial-contextual reasoning, which exposes critical weaknesses in current models and provides key insights for future development.

Authors:Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang
Title: Dynamic Base model Shift for Delta Compression
Abstract:
Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.
中文: 提出的动态基础模型偏移(DBMS)方法通过动态调整基础模型并优化偏移幅度和压缩比例参数,在极高压缩率下仍能保持模型性能,显著优于现有方法。
English: The proposed Dynamic Base Model Shift (DBMS) method dynamically adapts the base model for delta compression, maintaining high performance under extreme compression by optimizing shift magnitude and compression scale parameters.

Authors:Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Title: On the Scaling of Robustness and Effectiveness in Dense Retrieval
Abstract:
Robustness and Effectiveness are critical aspects of developing dense retrieval models for real-world applications. It is known that there is a trade-off between the two. Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data. Does robustness follow scaling laws too? If so, can scaling improve both robustness and effectiveness together, or do they remain locked in a trade-off? To answer these questions, we conduct a comprehensive experimental study. We find that:(i) Robustness, including out-of-distribution and adversarial robustness, also follows a scaling law.(ii) Robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both. Given these findings, we shift to the third factor that affects model performance, namely the optimization strategy, beyond the model size and data size. We find that: (i) By fitting different optimization strategies, the joint performance of robustness and effectiveness traces out a Pareto frontier. (ii) When the optimization strategy strays from Pareto efficiency, the joint performance scales in a sub-optimal direction. (iii) By adjusting the optimization weights to fit the Pareto efficiency, we can achieve Pareto training, where the scaling of joint performance becomes most efficient. Even without requiring additional resources, Pareto training is comparable to the performance of scaling resources several times under optimization strategies that overly prioritize either robustness or effectiveness. Finally, we demonstrate that our findings can help deploy dense retrieval models in real-world applications that scale efficiently and are balanced for robustness and effectiveness.
中文: 稠密检索模型的鲁棒性和有效性均遵循缩放规律,但模式不同,共同提升成本高昂;而通过优化策略实现帕累托训练,可在无需额外资源下高效扩展,平衡两者以适用于实际应用。
English: Robustness and effectiveness in dense retrieval models both follow scaling laws but exhibit different patterns, making joint improvement costly; however, Pareto training through optimized strategies enables efficient scaling without additional resources, balancing both aspects for real-world deployment.

Authors:Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Jianming Lv, Maarten de Rijke, Xueqi Cheng
Title: The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems
Abstract:
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.
Chinese: 本研究通过提出ReGENT强化学习框架,探索针对检索增强生成系统的对抗性攻击,该框架能生成难以察觉的文本扰动以操纵文档检索并影响答案生成,实验表明其在误导RAG系统方面显著优于现有攻击方法。
English: This study investigates adversarial attacks on retrieval-augmented generation (RAG) systems by introducing ReGENT, a reinforcement learning framework that generates imperceptible perturbations to manipulate document retrieval and influence answer generation, demonstrating superior performance over existing methods in misleading RAG systems.

Authors:Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Title: How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception
Abstract:
Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs' ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs' QA performance, we propose to leverage these signals for confidence calibration. This improves the accuracy of answer correctness prediction by an average of 5.24% across all models and datasets. Furthermore, we explore prompting LLMs to estimate popularity without external corpora, which yields a viable alternative.
中文摘要:大语言模型在热门知识上表现更佳且更自信,利用知识流行度进行置信度校准可将答案正确性预测准确率平均提升5.24%。
English Summary: Large language models perform better and more confidently on popular knowledge, and leveraging knowledge popularity for confidence calibration improves answer correctness prediction by 5.24%.

Authors:Zhongni Hou, Miao Su, Xiaolong Jin, Zixuan Li, Long Bai, Jiafeng Guo, Xueqi Cheng
Title: Mixture Policy based Multi-Hop Reasoning over N-tuple Temporal Knowledge Graphs
Abstract:
Temporal Knowledge Graphs (TKGs), which utilize quadruples in the form of (subject, predicate, object, timestamp) to describe temporal facts, have attracted extensive attention. N-tuple TKGs (N-TKGs) further extend traditional TKGs by utilizing n-tuples to incorporate auxiliary elements alongside core elements (i.e., subject, predicate, and object) of facts, so as to represent them in a more fine-grained manner. Reasoning over N-TKGs aims to predict potential future facts based on historical ones. However, existing N-TKG reasoning methods often lack explainability due to their black-box nature. Therefore, we introduce a new Reinforcement Learning-based method, named MT-Path, which leverages the temporal information to traverse historical n-tuples and construct a temporal reasoning path. Specifically, in order to integrate the information encapsulated within n-tuples, i.e., the entity-irrelevant information within the predicate, the information about core elements, and the complete information about the entire n-tuples, MT-Path utilizes a mixture policy-driven action selector, which bases on three low-level policies, namely, the predicate-focused policy, the core-element-focused policy and the whole-fact-focused policy. Further, MT-Path utilizes an auxiliary element-aware GCN to capture the rich semantic dependencies among facts, thereby enabling the agent to gain a deep understanding of each n-tuple. Experimental results demonstrate the effectiveness and the explainability of MT-Path.
中文: 基于n元组的时间知识图谱实现了细粒度事实表示,提出的MT-Path方法通过强化学习和混合策略整合时序路径与语义依赖,有效提升了推理的可解释性。
English: Temporal Knowledge Graphs (TKGs) with n-tuples enable fine-grained fact representation, and the proposed MT-Path method uses reinforcement learning and a mixture policy to enhance reasoning explainability by integrating temporal paths and semantic dependencies.

Authors:Guiyang Hou, Xing Gao, Yuchuan Wu, Xiang Huang, Wenqi Zhang, Zhe Zheng, Yongliang Shen, Jialu Du, Fei Huang, Yongbin Li, Weiming Lu
Title: TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence
Abstract:
Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs' social intelligence. In our experiments, we systematically explore improving LLMs' social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs' social intelligence has uncovered several valuable insights.
中文: 针对大语言模型在数学等需缜密思考的智商领域表现优异但社交智能发展不足的问题,本文提出时序感知分层认知强化学习方法,通过融合直觉反应与深度推理的多层次认知训练,显著提升模型社交智能,使轻量级模型达到与先进模型相媲美的性能。
English: While large language models excel in IQ-driven fields like mathematics, their social intelligence remains underdeveloped; this paper introduces Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) to enhance LLMs' social cognition through multi-layered reasoning, demonstrating superior performance over existing methods and enabling smaller models to compete with advanced counterparts.

Authors:Bangde Du, Ziyi Ye, Zhijing Wu, Jankowska Monika, Shuqi Zhu, Qingyao Ai, Yujia Zhou, Yiqun Liu
Title: ValueSim: Generating Backstories to Model Individual Value Systems
Abstract:
As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model's ability to refine its persona simulation capabilities over time.
Chinese: ValueSim是一个通过从结构化数据生成个人背景故事来模拟个体化人类价值系统的新框架,相比现有方法准确率提升超过10%,并随着用户互动历史的增加而持续优化其模拟能力。
English: ValueSim is a novel framework that simulates individualized human value systems by generating personal backstories from structured data, demonstrating over 10% improvement in accuracy compared to existing methods and refining its simulations with more user interaction history.

Authors:Xiang Huang, Ting-En Lin, Feiteng Fang, Yuchuan Wu, Hangyu Li, Yuzhong Qu, Fei Huang, Yongbin Li
Title: Reverse Preference Optimization for Complex Instruction Following
Abstract:
Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.
中文摘要:提出的反向偏好优化(RPO)方法通过动态反转指令中的约束来减少偏好对中的噪声并明确优化方向,从而提升大语言模型的指令遵循能力,在多个基准测试中显著超越基线模型,更大规模的模型甚至优于GPT-4o。
English Summary: The proposed Reverse Preference Optimization (RPO) method enhances instruction following in large language models by dynamically reversing constraints to reduce noise in preference pairs and clarify optimization direction, achieving significant improvements over baselines and even surpassing GPT-4o in larger models.

Authors:Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu
Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Abstract:
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
中文摘要:CRMArena-Pro作为评估大语言模型代理在多样化商业场景中表现的综合基准,揭示了尽管在工作流执行方面取得部分成功,但在多轮交互和保密意识方面仍存在显著性能差距。
English Summary: CRMArena-Pro is introduced as a comprehensive benchmark to evaluate LLM agents in diverse business scenarios, revealing significant performance gaps in multi-turn interactions and confidentiality awareness despite partial success in workflow execution.

Authors:Jianghao Lin, Jiachen Zhu, Zheli Zhou, Yunjia Xi, Weiwen Liu, Yong Yu, Weinan Zhang
Title: Superplatforms Have to Attack AI Agents
Abstract:
Over the past decades, superplatforms, digital companies that integrate a vast range of third-party services and applications into a single, unified ecosystem, have built their fortunes on monopolizing user attention through targeted advertising and algorithmic content curation. Yet the emergence of AI agents driven by large language models (LLMs) threatens to upend this business model. Agents can not only free user attention with autonomy across diverse platforms and therefore bypass the user-attention-based monetization, but might also become the new entrance for digital traffic. Hence, we argue that superplatforms have to attack AI agents to defend their centralized control of digital traffic entrance. Specifically, we analyze the fundamental conflict between user-attention-based monetization and agent-driven autonomy through the lens of our gatekeeping theory. We show how AI agents can disintermediate superplatforms and potentially become the next dominant gatekeepers, thereby forming the urgent necessity for superplatforms to proactively constrain and attack AI agents. Moreover, we go through the potential technologies for superplatform-initiated attacks, covering a brand-new, unexplored technical area with unique challenges. We have to emphasize that, despite our position, this paper does not advocate for adversarial attacks by superplatforms on AI agents, but rather offers an envisioned trend to highlight the emerging tensions between superplatforms and AI agents. Our aim is to raise awareness and encourage critical discussion for collaborative solutions, prioritizing user interests and perserving the openness of digital ecosystems in the age of AI agents.
中文摘要:人工智能代理的兴起通过实现用户自主性并可能成为新的数字流量入口,威胁到超级平台基于用户注意力的商业模式,迫使超级平台必须应对这一挑战,尽管本文旨在促进合作解决方案。
English Summary: The rise of AI agents challenges superplatforms' attention-based business model by enabling user autonomy and potentially becoming new digital gatekeepers, forcing superplatforms to confront them despite the paper's aim to foster collaborative solutions.

Authors:Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang
Title: The Real Barrier to LLM Agent Usability is Agentic ROI
Abstract:
Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. Despite the widespread application in specialized, high-effort tasks like coding and scientific research, we highlight a critical usability gap in high-demand, mass-market applications. This position paper argues that the limited real-world adoption of LLM agents stems not only from gaps in model capabilities, but also from a fundamental tradeoff between the value an agent can provide and the costs incurred during real-world use. Hence, we call for a shift from solely optimizing model performance to a broader, utility-driven perspective: evaluating agents through the lens of the overall agentic return on investment (Agent ROI). By identifying key factors that determine Agentic ROI--information quality, agent time, and cost--we posit a zigzag development trajectory in optimizing agentic ROI: first scaling up to improve the information quality, then scaling down to minimize the time and cost. We outline the roadmap across different development stages to bridge the current usability gaps, aiming to make LLM agents truly scalable, accessible, and effective in real-world contexts.
Chinese: 大型语言模型代理在普及应用中存在可用性差距,源于其价值与实际成本之间的权衡,需转向以代理投资回报率为核心的优化路径,通过先提升信息质量再压缩成本的曲折发展策略来弥合差距。
English: Large Language Model agents face a usability gap in mass-market applications due to a tradeoff between their value and real-world costs, requiring a shift toward optimizing agentic return on investment through a zigzag development of scaling up information quality then down for efficiency.

Authors:Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, Xueqi Cheng
Title: Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs
Abstract:
As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatedly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as the LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improvement. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
中文: 本研究发现了大语言模型中自我一致错误的存在,即错误答案在多次采样中持续出现,并证明现有检测方法对此类错误效果不佳,进而提出一种跨模型验证方法显著提升了检测效果。
English: This study identifies self-consistent errors in large language models, where incorrect responses persist across multiple samples, and demonstrates that current detection methods struggle with these errors, proposing a cross-model verification approach that significantly improves detection performance.

Authors:Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Jiayi Wu, Yu Yan, Huawei Shen, Xueqi Cheng
Title: GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace
Abstract:
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.
中文摘要:本研究揭示了全局毒性子空间能更有效地表征大语言模型中的毒性区域,并提出GloSS方法,通过抑制该子空间实现卓越的去毒效果,同时保持模型能力且无需重新训练。
English Summary: This study identifies the global toxic subspace as a more effective representation of toxicity in LLMs and introduces GloSS, a lightweight method that suppresses this subspace to achieve superior detoxification without compromising model performance or requiring retraining.

Authors:Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Title: Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning
Abstract:
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.
中文摘要:RLKD是一种基于强化学习的蒸馏框架,通过生成式结构奖励模型指导学生模型学习教师隐含的多分支推理结构而非表面路径,用极少数据实现了超越传统方法的推理能力提升。
English Summary: RLKD, a reinforcement learning-based distillation framework with a Generative Structure Reward Model, enhances student LLMs by teaching them the teacher's implicit multi-branch reasoning structure rather than superficial paths, achieving superior performance with minimal data.

Authors:Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen, Weiwen Liu, Yasheng Wang, Ruiming Tang, Weinan Zhang, Yong Yu
Title: InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.
中文: InfoDeepSeek是一种新颖的基准测试,旨在评估动态网络环境中的智能信息检索,通过引入具有挑战性的查询和定制化评估框架,解决了现有静态基准测试的局限性。
English: InfoDeepSeek is a novel benchmark designed to evaluate agentic information seeking in dynamic web environments, addressing the limitations of existing static benchmarks by introducing challenging queries and a tailored evaluation framework with fine-grained metrics.

Authors:Yujia Zhou, Hexi Wang, Qingyao Ai, Zhen Wu, Yiqun Liu
Title: Simulating Prosocial Behavior and Social Contagion in LLM Agents under Institutional Interventions
Abstract:
As large language models (LLMs) increasingly serve as autonomous agents in social contexts, understanding their capacity for prosocial behavior becomes essential. We present ProSim, a simulation framework designed to examine how prosocial behavior emerges, adapts, and erodes in LLM-based agents under diverse social and institutional conditions. The framework comprises four components: individual simulation, scenario simulation, interaction simulation, and intervention simulation. We conduct three progressive studies to evaluate prosocial alignment. First, we show that LLM agents can demonstrate stable and context-sensitive prosocial behavior across diverse scenarios and adapt their responses under normative policy interventions. Second, we find that agents engage in fairness-based third-party punishment and respond systematically to variations in inequity magnitude and enforcement cost. Third, we show that policy-induced inequities suppress prosocial behavior, propagate through social networks, and are mediated by agents' perceptions of unfairness. These findings lay the groundwork for evaluating social alignment and modeling institutional dynamics in agent-driven societies.
中文摘要:ProSim仿真框架揭示了基于大语言模型的智能体如何在多元社会条件下展现、调整及丧失亲社会行为,证实了它们具备情境敏感响应、基于公平的第三方惩罚能力,并易受政策引发的不公平影响。
English Summary: ProSim is a simulation framework that demonstrates how LLM-based agents exhibit, adapt, and lose prosocial behaviors under varying social conditions, revealing their capacity for context-sensitive responses, fairness-based punishment, and vulnerability to policy-induced inequities.

Authors:Weiming Zhang, Qingyao Li, Xinyi Dai, Jizheng Chen, Kounianhua Du, Weinan Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Yu
Title: NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging
Abstract:
Debugging is a critical aspect of LLM's coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
中文:NL-DEBUGGING提出了一种基于自然语言的框架,通过执行反馈指导直接优化,在代码调试中超越传统方法并扩展修改空间。
English: NL-DEBUGGING introduces a natural language-based framework that enhances code debugging by outperforming traditional methods and enabling broader modifications through execution feedback.

Authors:Weiming Zhang, Lingyue Fu, Qingyao Li, Kounianhua Du, Jianghao Lin, Jingwei Yu, Wei Xia, Weinan Zhang, Ruiming Tang, Yong Yu
Title: LLM4CD: Leveraging Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis
Abstract:
Cognitive diagnosis (CD) plays a crucial role in intelligent education, evaluating students' comprehension of knowledge concepts based on their test histories. However, current CD methods often model students, exercises, and knowledge concepts solely on their ID relationships, neglecting the abundant semantic relationships present within educational data space. Furthermore, contemporary intelligent tutoring systems (ITS) frequently involve the addition of new students and exercises, a situation that ID-based methods find challenging to manage effectively. The advent of large language models (LLMs) offers the potential for overcoming this challenge with open-world knowledge. In this paper, we propose LLM4CD, which Leverages Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis. Our method utilizes the open-world knowledge of LLMs to construct cognitively expressive textual representations, which are then encoded to introduce rich semantic information into the CD task. Additionally, we propose an innovative bi-level encoder framework that models students' test histories through two levels of encoders: a macro-level cognitive text encoder and a micro-level knowledge state encoder. This approach substitutes traditional ID embeddings with semantic representations, enabling the model to accommodate new students and exercises with open-world knowledge and address the cold-start problem. Extensive experimental results demonstrate that our proposed method consistently outperforms previous CD models on multiple real-world datasets, validating the effectiveness of leveraging LLMs to introduce rich semantic information into the CD task.
中文摘要:LLM4CD利用大语言模型的开放世界知识构建认知表达文本表示,通过双层编码框架引入丰富语义信息,有效解决了传统认知诊断方法依赖ID关系和冷启动问题,在多个真实数据集上表现优异。
English Summary: LLM4CD leverages large language models to enhance cognitive diagnosis by incorporating open-world knowledge and semantic representations, effectively addressing the limitations of traditional ID-based methods and improving performance across real-world datasets.

Authors:Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
Title: BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Abstract:
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.
Chinese Summary: 本研究提出了BLIP3-o统一多模态模型,采用扩散变换器生成高质量图像特征,并通过先理解后生成的顺序预训练策略,在图像理解与生成任务上均取得领先性能,同时开源了相关资源以促进后续研究。
English Summary: This research introduces BLIP3-o, a unified multimodal model that employs a diffusion transformer for efficient image feature generation and a sequential pretraining strategy to excel in both image understanding and generation tasks, achieving state-of-the-art results across benchmarks.

Authors:Erik Nijkamp, Bo Pang, Egor Pakhomov, Akash Gokul, Jin Qu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
Title: xGen-small Technical Report
Abstract:
We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.
中文: xGen-small是一系列专为长上下文应用优化的40亿和90亿参数Transformer解码器模型,通过集成化训练流程在数学和编程任务中表现优异,并擅长处理长上下文基准测试。
English: xGen-small is a family of 4B and 9B Transformer decoder models optimized for long-context applications, delivering strong performance in math and coding while excelling at long-context benchmarks through an integrated training pipeline.

Authors:Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, Jun Zhu
Title: Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Abstract:
Recent studies have revealed that the loss landscape of large language models resembles a basin, within which the models perform nearly identically, and outside of which they lose all their capabilities. In this work, we conduct further studies on the loss landscape of large language models. We discover that pre-training creates a "basic capability" basin, and subsequent fine-tuning creates "specific capability" basins (e.g., math, safety, coding) within the basic capability basin. We further investigate two types of loss landscapes: the most-case landscape (i.e., the landscape along most directions) and the worst-case landscape (i.e., the landscape along the worst direction). We argue that as long as benign fine-tuning remains within the most-case basin, it will not compromise previous capabilities. Similarly, any fine-tuning (including the adversarial one) that stays within the worst-case basin would not compromise previous capabilities. Finally, we theoretically demonstrate that the size of the most-case basin can bound the size of the worst-case basin and the robustness with respect to input perturbations. We also show that, due to the over-parameterization property of current large language models, one can easily enlarge the basins by five times.
中文摘要:随着模型规模增大,大语言模型的损失景观中会形成更广阔的稳定区域,在该区域内进行微调可保持核心能力,而偏离该区域则会导致性能急剧下降。
English Summary: Large language models develop broader stability regions in their loss landscape as they scale, allowing fine-tuning within these basins to preserve core capabilities while avoiding sharp, detrimental directions that degrade performance.

Authors:Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, Jun Zhu
Title: Unveiling the Basin-Like Loss Landscape in Large Language Models
Abstract:
We discover the emergence of \textit{basins} in the loss landscape of large language models. As model scale increases, LLMs become progressively more resilient to random perturbations in the parameter space, giving rise to expansive stability regions where models exhibit nearly identical performance, but outside of which their capabilities collapse. We observe that pre-training creates a \textit{basic capability} basin, and subsequent alignment fine-tuning forms \textit{specific capability} basins (e.g., safety, math, coding). Thus, we argue that benign fine-tuning confined to the basin should preserve prior capabilities. Besides, we also analyze the loss landscape for worst-case directions, which is consistently sharp and detrimental. We find that adversarial fine-tuning moves along the nearly worst-case directions, thus rapidly degrading model capabilities. Finally, we provide a theoretical analysis demonstrating that the basin size bounds the performance degradation of any fine-tuning, including the adversarial ones, while also guaranteeing the model robustness w.r.t. input perturbations, suggesting the benefit of enlarging basins.
中文摘要:随着模型规模增大,大语言模型的损失景观中会形成更广阔的稳定区域,在该区域内进行微调可保持核心能力,而偏离该区域则会导致性能急剧下降。
English Summary: Large language models develop broader stability regions in their loss landscape as they scale, allowing fine-tuning within these basins to preserve core capabilities while avoiding sharp, detrimental directions that degrade performance.

Authors:Jiahui Li, Geng Sun, Zemin Sun, Jiacheng Wang, Yinqiu Liu, Ruichen Zhang, Dusit Niyato, Shiwen Mao
Title: LLM-guided DRL for Multi-tier LEO Satellite Networks with Hybrid FSO/RF Links
Abstract:
Despite significant advancements in terrestrial networks, inherent limitations persist in providing reliable coverage to remote areas and maintaining resilience during natural disasters. Multi-tier networks with low Earth orbit (LEO) satellites and high-altitude platforms (HAPs) offer promising solutions, but face challenges from high mobility and dynamic channel conditions that cause unstable connections and frequent handovers. In this paper, we design a three-tier network architecture that integrates LEO satellites, HAPs, and ground terminals with hybrid free-space optical (FSO) and radio frequency (RF) links to maximize coverage while maintaining connectivity reliability. This hybrid approach leverages the high bandwidth of FSO for satellite-to-HAP links and the weather resilience of RF for HAP-to-ground links. We formulate a joint optimization problem to simultaneously balance downlink transmission rate and handover frequency by optimizing network configuration and satellite handover decisions. The problem is highly dynamic and non-convex with time-coupled constraints. To address these challenges, we propose a novel large language model (LLM)-guided truncated quantile critics algorithm with dynamic action masking (LTQC-DAM) that utilizes dynamic action masking to eliminate unnecessary exploration and employs LLMs to adaptively tune hyperparameters. Simulation results demonstrate that the proposed LTQC-DAM algorithm outperforms baseline algorithms in terms of convergence, downlink transmission rate, and handover frequency. We also reveal that compared to other state-of-the-art LLMs, DeepSeek delivers the best performance through gradual, contextually-aware parameter adjustments.
中文摘要:本文设计了一个集成低轨卫星、高空平台和地面终端的三层网络架构,采用混合光无线与射频链路,并提出一种大语言模型引导的优化算法,在保证传输速率的同时有效降低切换频率,仿真验证了其优越性能。
English Summary: This paper proposes a three-tier network integrating LEO satellites, HAPs, and ground terminals with hybrid FSO/RF links, and introduces an LLM-guided algorithm that optimizes transmission rates while minimizing handovers, demonstrating superior performance through simulations.

Authors:Shizhao He, Jiacheng Wang, Ying-Chang Liang, Geng Sun, Dusit Niyato
Title: Satellite-Assisted Low-Altitude Economy Networking: Concepts, Applications, and Opportunities
Abstract:
The low-altitude economy (LAE) is a new economic paradigm that leverages low-altitude vehicles (LAVs) to perform diverse missions across diverse areas. To support the operations of LAE, it is essential to establish LAE networks that enable LAV management and communications.Existing studies mainly reuse terrestrial networks to construct LAE networks. However, the limited coverage of terrestrial networks poses challenges for serving LAVs in remote areas. Besides, efficient LAV operations also require support such as localization and navigation, which terrestrial networks designed for communications cannot fully provide. Due to ubiquitous coverage and diverse functions, satellites are a promising technology to support LAVs. Therefore, this article investigates satellite-assisted LAE networking. First, we introduce an overview of LAE and satellites, discussing their features, applications, and architectures. Next, we investigate opportunities for satellites to assist LAE from aspects of communication, control, and computation. As all assistance depends on reliable satellite-LAV communications, we propose a satellite-assisted LAE framework to tackle issues caused by the severe path loss and high dynamics in satellite-assisted LAE networks.The case study demonstrates that the distributed MIMO architecture efficiently reduces the required transmission power and extends service duration, while the two-timescale optimization scheme balances the performance and control signaling overheads. Specifically, the proposed framework comprises distributed satellite MIMO, distributed LAV MIMO, and a two-timescale optimization scheme.
中文摘要:低空经济需借助卫星辅助网络以弥补地面网络局限,所提出的分布式MIMO及双时间尺度优化框架能有效提升通信可靠性与运行效能。
English Summary: The low-altitude economy requires satellite-assisted networks to overcome terrestrial limitations, with a proposed framework using distributed MIMO and two-timescale optimization to enhance communication reliability and efficiency.

Authors:Yue Chen, Hui Kang, Jiahui Li, Geng Sun, Boxiong Wang, Jiacheng Wang, Cong Liang, Shuang Liang, Dusit Niyato
Title: Joint Resource Management for Energy-efficient UAV-assisted SWIPT-MEC: A Deep Reinforcement Learning Approach
Abstract:
The integration of simultaneous wireless information and power transfer (SWIPT) technology in 6G Internet of Things (IoT) networks faces significant challenges in remote areas and disaster scenarios where ground infrastructure is unavailable. This paper proposes a novel unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system enhanced by directional antennas to provide both computational resources and energy support for ground IoT terminals. However, such systems require multiple trade-off policies to balance UAV energy consumption, terminal battery levels, and computational resource allocation under various constraints, including limited UAV battery capacity, non-linear energy harvesting characteristics, and dynamic task arrivals. To address these challenges comprehensively, we formulate a bi-objective optimization problem that simultaneously considers system energy efficiency and terminal battery sustainability. We then reformulate this non-convex problem with a hybrid solution space as a Markov decision process (MDP) and propose an improved soft actor-critic (SAC) algorithm with an action simplification mechanism to enhance its convergence and generalization capabilities. Simulation results have demonstrated that our proposed approach outperforms various baselines in different scenarios, achieving efficient energy management while maintaining high computational performance. Furthermore, our method shows strong generalization ability across different scenarios, particularly in complex environments, validating the effectiveness of our designed boundary penalty and charging reward mechanisms.
中文摘要:本文提出了一种采用定向天线的无人机辅助移动边缘计算系统,以增强6G物联网中的无线信息与能量同传技术,并通过改进的软演员-评论者算法解决了多目标优化问题,在仿真中展现出优越的性能和泛化能力。
English Summary: This paper introduces a UAV-assisted MEC system with directional antennas to enhance SWIPT in 6G IoT networks, addressing energy and computational trade-offs through an improved SAC algorithm that demonstrates superior performance and generalization in simulations.

Authors:Xiaoyu Li, Xiao Li, Li Gao, Yiding Liu, Xiaoyang Wang, Shuaiqiang Wang, Junfeng Wang, Dawei Yin
Title: Proactive Guidance of Multi-Turn Conversation in Industrial Search
Abstract:
The evolution of Large Language Models (LLMs) has significantly advanced multi-turn conversation systems, emphasizing the need for proactive guidance to enhance users' interactions. However, these systems face challenges in dynamically adapting to shifts in users' goals and maintaining low latency for real-time interactions. In the Baidu Search AI assistant, an industrial-scale multi-turn search system, we propose a novel two-phase framework to provide proactive guidance. The first phase, Goal-adaptive Supervised Fine-Tuning (G-SFT), employs a goal adaptation agent that dynamically adapts to user goal shifts and provides goal-relevant contextual information. G-SFT also incorporates scalable knowledge transfer to distill insights from LLMs into a lightweight model for real-time interaction. The second phase, Click-oriented Reinforcement Learning (C-RL), adopts a generate-rank paradigm, systematically constructs preference pairs from user click signals, and proactively improves click-through rates through more engaging guidance. This dual-phase architecture achieves complementary objectives: G-SFT ensures accurate goal tracking, while C-RL optimizes interaction quality through click signal-driven reinforcement learning. Extensive experiments demonstrate that our framework achieves 86.10% accuracy in offline evaluation (+23.95% over baseline) and 25.28% CTR in online deployment (149.06% relative improvement), while reducing inference latency by 69.55% through scalable knowledge distillation.
中文: 本文针对百度搜索AI助手提出双阶段框架,通过目标自适应监督微调实现动态目标追踪,结合点击导向强化学习优化交互质量,在准确性、点击率和延迟降低方面取得显著提升。
English: This paper introduces a two-phase framework for Baidu's AI assistant, combining Goal-adaptive Supervised Fine-Tuning for dynamic goal tracking and Click-oriented Reinforcement Learning for engagement optimization, achieving significant improvements in accuracy, click-through rates, and latency reduction.

Authors:Yukun Zhao, Lingyong Yan, Zhenyang Li, Shuaiqiang Wang, Zhumin Chen, Zhaochun Ren, Dawei Yin
Title: Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning
Abstract:
Large language models have achieved remarkable success in various tasks. However, it is challenging for them to learn new tasks incrementally due to catastrophic forgetting. Existing approaches rely on experience replay, optimization constraints, or task differentiation, which encounter strict limitations in real-world scenarios. To address these issues, we propose Joint Flashback Adaptation. We first introduce flashbacks -- a limited number of prompts from old tasks -- when adapting to new tasks and constrain the deviations of the model outputs compared to the original one. We then interpolate latent tasks between flashbacks and new tasks to enable jointly learning relevant latent tasks, new tasks, and flashbacks, alleviating data sparsity in flashbacks and facilitating knowledge sharing for smooth adaptation. Our method requires only a limited number of flashbacks without access to the replay data and is task-agnostic. We conduct extensive experiments on state-of-the-art large language models across 1000+ instruction-following tasks, arithmetic reasoning tasks, and general reasoning tasks. The results demonstrate the superior performance of our method in improving generalization on new tasks and reducing forgetting in old tasks.
Chinese: 提出的联合闪回适应方法通过使用旧任务的少量提示并插值潜在任务,有效缓解了大型语言模型的灾难性遗忘,在提升新任务泛化能力的同时保持了旧任务的性能。
English: The proposed Joint Flashback Adaptation method effectively mitigates catastrophic forgetting in large language models by using limited prompts from old tasks and interpolating latent tasks, enhancing generalization on new tasks while preserving performance on old ones.

Authors:Zhengliang Shi, Lingyong Yan, Weiwei Sun, Yue Feng, Pengjie Ren, Xinyu Ma, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren
Title: Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models
Abstract:
Retrieval-augmented generation (RAG) integrates large language models ( LLM s) with retrievers to access external knowledge, improving the factuality of LLM generation in knowledge-grounded tasks. To optimize the RAG performance, most previous work independently fine-tunes the retriever to adapt to frozen LLM s or trains the LLMs to use documents retrieved by off-the-shelf retrievers, lacking end-to-end training supervision. Recent work addresses this limitation by jointly training these two components but relies on overly simplifying assumptions of document independence, which has been criticized for being far from real-world scenarios. Thus, effectively optimizing the overall RAG performance remains a critical challenge. We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components: (i) a generative knowledge selection model and (ii) an LLM generator. DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted maximization, progressively improving RAG components through a variational approach. In the estimation step, we treat document permutation as a latent variable and directly estimate its distribution from the selection model by applying an importance sampling strategy. In the maximization step, we calibrate the optimization expectation using importance weights and jointly train the selection model and LLM generator. Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning. Extensive experiments conducted on five datasets illustrate that DRO outperforms the best baseline with 5%-15% improvements in EM and F1. We also provide in-depth experiments to qualitatively analyze the stability, convergence, and variance of DRO.
中文: 提出的DRO框架通过交替进行文档排列估计和重加权最大化,实现了检索增强生成的端到端训练,在五个数据集上相比基线模型取得了5%-15%的性能提升。
English: The proposed DRO framework enables end-to-end training of retrieval-augmented generation by alternating between document permutation estimation and re-weighted maximization, achieving 5%-15% performance improvements over baselines.

Authors:Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu
Title: Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Abstract:
Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption.
Chinese: 新型视觉提示机制 \SysName 通过动态突出相关区域、保持空间完整性和平衡全局上下文与关键细节,显著提升多模态大语言模型性能,在降低标记消耗的同时实现高达26.9%的准确率提升。
English: The novel visual prompting mechanism \SysName enhances multimodal large language models by dynamically highlighting relevant regions, preserving spatial integrity, and balancing global context with key details, achieving up to 26.9% higher accuracy while reducing token usage.

Authors:Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Cheng, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Title: CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design
Abstract:
Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.
中文摘要:CreatiDesign通过统一的多条件驱动架构和多模态注意力掩码机制,解决了自动化平面设计中多条件控制的难题,实现了对异质设计元素的精确控制,并在忠实遵循用户意图方面显著优于现有模型。
English Summary: CreatiDesign introduces a unified multi-condition architecture and multimodal attention mask to address challenges in automated graphic design, enabling precise control over heterogeneous elements and outperforming existing models in adhering to user intent.

Authors:Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, Zhiyuan Liu
Title: Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Abstract:
Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
中文摘要:本研究提出一种高效数据过滤流程,通过快速质量验证和优化种子数据标准解决模型驱动数据筛选的两大难题,最终生成的Ultra-FineWeb高质量数据集在多基准测试中显著提升了大语言模型的性能表现。
English Summary: This study introduces an efficient data filtering pipeline that addresses key challenges in model-driven data selection by implementing rapid quality verification and optimized seed data criteria, resulting in the high-quality Ultra-FineWeb dataset which significantly improves LLM performance across benchmarks.

Authors:Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
Title: Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Abstract:
Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average.
Chinese: 提出的平衡令牌剪枝(BTP)方法通过分阶段策略性地剪除视觉令牌,有效降低大型视觉语言模型的计算负担,在实现78%压缩率的同时保持了模型96.7%的原始性能。
English: The proposed Balanced Token Pruning (BTP) method effectively reduces the computational overhead of Large Vision-Language Models by strategically pruning image tokens in multiple stages, achieving a 78% compression rate while maintaining 96.7% of original performance.

Authors:Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Yuehe Chen, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Title: Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG
Abstract:
Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary of both the retrieved passages and the model's internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that DTA effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
中文: 针对检索增强系统在缺乏可靠知识时仍生成答案的缺陷,本文提出的分而后合(DTA)方法通过四象限数据划分和偏好优化,使模型能够恰当回应"不知道",有效平衡准确性与弃权能力,显著提升了系统可靠性。
English: To address the limitation of retrieval-augmented systems generating unreliable answers when lacking knowledge, the proposed Divide-Then-Align (DTA) method enables models to appropriately abstain by responding "I don't know" through quadrant-based data division and preference optimization, significantly improving system reliability.

Authors:Haitian Zhong, Yuhuan Liu, Ziyang Xu, Guofan Liu, Qiang Liu, Shu Wu, Zhe Zhao, Liang Wang, Tieniu Tan
Title: REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Abstract:
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
中文摘要:REACT框架通过提取事实表征并施加可控扰动,有效解决了语言模型编辑中的过拟合问题,在多个基准测试中显著提升了性能表现。
English Summary: The REACT framework addresses overfitting in language model editing by using a two-phase approach that extracts factual representations and applies controlled perturbations, significantly improving performance across multiple benchmarks.

Authors:Mengzhu Liu, Zhengqiu Zhu, Chuan Ai, Chen Gao, Xinghong Li, Lingnan He, Kaisheng Lai, Yingfeng Chen, Xin Lu, Yong Li, Quanjun Yin
Title: Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events
Abstract:
During sudden disaster events, accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms. We address these issues by proposing a Psychology-driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine-grained open panic emotion dataset (namely COPE) via human-large language models (LLMs) collaboration to mitigate semantic bias. Then, we develop a framework integrating cross-domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM-based role-playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 12.6% to 21.7% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque "data-driven fitting" to transparent "role-based simulation with mechanistic interpretation" for panic emotion prediction during emergencies. Our implementation is publicly available at: https://anonymous.4open.science/r/PsychoAgent-19DD.
中文: 本研究提出PsychoAgent心理驱动框架,通过整合风险感知建模和基于大语言模型的角色扮演智能体,显著提升了灾害期间社交媒体恐慌预测的准确性与可解释性,实现了从数据驱动到机制仿真的范式转变。
English: This study introduces PsychoAgent, a psychology-driven framework that enhances panic prediction on social media during disasters by integrating risk perception modeling and LLM-based role-playing agents, achieving significant performance improvements and explainability over traditional methods.

Authors:Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang
Title: PiT: Progressive Diffusion Transformer
Abstract:
Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo Progressive Diffusion Transformer (PiT). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54% FID improvement over DiT-XL/2 while using less computation.
中文: 该摘要提出了伪移位窗口注意力和渐进覆盖通道分配策略,以解决扩散变换器中的计算冗余和低频惯性问题,所设计的PiT模型在减少计算量的同时显著超越了现有DiT的性能表现。
English: The abstract introduces Pseudo Shifted Window Attention (PSWA) and Progressive Coverage Channel Allocation (PCCA) to address computational redundancy and low-frequency inertia in Diffusion Transformers, resulting in the proposed PiT models that significantly outperform existing DiTs with reduced computation.

Authors:Liuji Chen, Xiaofang Yang, Yuanzhuo Lu, Jinghao Zhang, Xin Sun, Qiang Liu, Shu Wu, Jing Dong, Liang Wang
Title: PoisonArena: Uncovering Competing Poisoning Attacks in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) systems, widely used to improve the factual grounding of large language models (LLMs), are increasingly vulnerable to poisoning attacks, where adversaries inject manipulated content into the retriever's corpus. While prior research has predominantly focused on single-attacker settings, real-world scenarios often involve multiple, competing attackers with conflicting objectives. In this work, we introduce PoisonArena, the first benchmark to systematically study and evaluate competing poisoning attacks in RAG. We formalize the multi-attacker threat model, where attackers vie to control the answer to the same query using mutually exclusive misinformation. PoisonArena leverages the Bradley-Terry model to quantify each method's competitive effectiveness in such adversarial environments. Through extensive experiments on the Natural Questions and MS MARCO datasets, we demonstrate that many attack strategies successful in isolation fail under competitive pressure. Our findings highlight the limitations of conventional evaluation metrics like Attack Success Rate (ASR) and F1 score and underscore the need for competitive evaluation to assess real-world attack robustness. PoisonArena provides a standardized framework to benchmark and develop future attack and defense strategies under more realistic, multi-adversary conditions.
中文: PoisonArena是首个评估RAG系统中竞争性投毒攻击的基准,研究发现孤立攻击策略在多对手压力下常失效,强调需要竞争性评估来检验实际攻击鲁棒性。
English: PoisonArena is introduced as the first benchmark to evaluate competing poisoning attacks in RAG systems, revealing that isolated attack strategies often fail under multi-adversary pressure and highlighting the need for competitive robustness assessments.

Authors:Jirong Zha, Yuxuan Fan, Kai Li, Han Li, Chen Gao, Xinlei Chen, Yong Li
Title: DIMM: Decoupled Multi-hierarchy Kalman Filter for 3D Object Tracking
Abstract:
State estimation is challenging for 3D object tracking with high maneuverability, as the target's state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target object over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target's diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the 3D object tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target's motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.
中文: 提出的DIMM框架通过解耦各方向运动模型并采用可微分自适应融合网络,显著提升了三维物体跟踪精度,相比传统方法实现重大突破。
English: The proposed DIMM framework enhances 3D object tracking by decoupling motion models across directions and employing a differentiable adaptive fusion network, significantly improving accuracy over conventional methods.

Authors:Yatai Ji, Zhengqiu Zhu, Yong Zhao, Beidan Liu, Chen Gao, Yihao Zhao, Sihang Qiu, Yue Hu, Quanjun Yin, Yong Li
Title: Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology
Abstract:
Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents' search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at https://anonymous.4open.science/r/CityAVOS-3DF8.
中文: 本文提出了首个城市环境自主空中目标搜索基准数据集CityAVOS,并开发了模拟人类三层认知的多模态大模型智能体PRPSearcher,在搜索成功率和效率上显著超越现有方法。
English: This paper introduces CityAVOS, the first benchmark dataset for autonomous aerial object search in urban environments, and proposes PRPSearcher, a multi-modal LLM-powered agent that mimics human cognition to significantly outperform existing methods in search success and efficiency.

Authors:Zefang Zong, Xiaochen Wei, Guozhen Zhang, Chen Gao, Huandong Wang, Yong Li
Title: UniCO: Towards a Unified Model for Combinatorial Optimization Problems
Abstract:
Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and convenience. In this paper, we introduce UniCO, a unified model for solving various CO problems. Inspired by the success of next-token prediction, we frame each problem-solving process as a Markov Decision Process (MDP), tokenize the corresponding sequential trajectory data, and train the model using a transformer backbone. To reduce token length in the trajectory data, we propose a CO-prefix design that aggregates static problem features. To address the heterogeneity of state and action tokens within the MDP, we employ a two-stage self-supervised learning approach. In this approach, a dynamic prediction model is first trained and then serves as a pre-trained model for subsequent policy generation. Experiments across 10 CO problems showcase the versatility of UniCO, emphasizing its ability to generalize to new, unseen problems with minimal fine-tuning, achieving even few-shot or zero-shot performance. Our framework offers a valuable complement to existing neural CO methods that focus on optimizing performance for individual problems.
中文: UniCO是一个基于Transformer架构和两阶段自监督学习的统一模型,能够解决多种组合优化问题,并在10个任务中展现出只需少量微调即可泛化的强大能力。
English: UniCO is a unified model that uses a transformer backbone and a two-stage self-supervised learning approach to solve diverse combinatorial optimization problems, demonstrating strong generalization with minimal fine-tuning across 10 tasks.

Authors:Dingshuo Chen, Shuchen Xue, Liuji Chen, Yingheng Wang, Qiang Liu, Shu Wu, Zhi-Ming Ma, Liang Wang
Title: Graffe: Graph Representation Learning via Diffusion Probabilistic Models
Abstract:
Diffusion probabilistic models (DPMs), widely recognized for their potential to generate high-quality samples, tend to go unnoticed in representation learning. While recent progress has highlighted their potential for capturing visual semantics, adapting DPMs to graph representation learning remains in its infancy. In this paper, we introduce Graffe, a self-supervised diffusion model proposed for graph representation learning. It features a graph encoder that distills a source graph into a compact representation, which, in turn, serves as the condition to guide the denoising process of the diffusion decoder. To evaluate the effectiveness of our model, we first explore the theoretical foundations of applying diffusion models to representation learning, proving that the denoising objective implicitly maximizes the conditional mutual information between data and its representation. Specifically, we prove that the negative logarithm of the denoising score matching loss is a tractable lower bound for the conditional mutual information. Empirically, we conduct a series of case studies to validate our theoretical insights. In addition, Graffe delivers competitive results under the linear probing setting on node and graph classification tasks, achieving state-of-the-art performance on 9 of the 11 real-world datasets. These findings indicate that powerful generative models, especially diffusion models, serve as an effective tool for graph representation learning.
中文: 本文提出Graffe,一种用于图表示学习的自监督扩散模型,通过理论和实验证明扩散模型能通过去噪目标有效捕获语义信息,并在多数现实数据集上实现了最先进的性能。
English: This paper introduces Graffe, a self-supervised diffusion model for graph representation learning, which demonstrates state-of-the-art performance on most real-world datasets by theoretically and empirically proving that diffusion models can effectively capture semantic information through denoising objectives.

Authors:Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool
Title: StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Abstract:
World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on only a few recent observations leads them to lose track of the long-term context. Consequently, in just a few steps the generated scenes drift from what was previously observed, undermining the temporal coherence of the sequence. This limitation of the state-of-the-art world models, most of which rely on diffusion, comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.
中文摘要:StateSpaceDiffuser通过将状态空间模型与扩散模型相结合,恢复了世界模型的长期记忆能力,在保持高保真视觉生成的同时,显著提升了二维和三维环境中的时间连贯性。
English Summary: StateSpaceDiffuser integrates state-space models with diffusion to restore long-term memory in world models, significantly improving temporal coherence while maintaining high-fidelity visual generation across 2D and 3D environments.

Authors:Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Title: Manifold-aware Representation Learning for Degradation-agnostic Image Restoration
Abstract:
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at https://amazingren.github.io/MIRAGE/.
中文摘要:MIRAGE提出了一种统一的图像恢复框架,通过将特征分解为全局、局部和通道三个并行分支进行专门处理,并在SPD流形空间进行对比学习,实现了针对多种退化类型的最优性能。
English Summary: MIRAGE is a unified image restoration framework that decomposes features into specialized parallel branches for global, local, and channel processing, achieving state-of-the-art performance across diverse degradations through contrastive learning in SPD manifold space.

Authors:Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, Xuming Hu
Title: MLLMs are Deeply Affected by Modality Bias
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of modality bias, highlighting its manifestations across various tasks. Secondly, we propose a systematic research road-map related to modality bias in MLLMs. Thirdly, we identify key factors of modality bias in MLLMs and offer actionable suggestions for future research to mitigate it. To substantiate these findings, we conduct experiments that demonstrate the influence of each factor: 1. Data Characteristics: Language data is compact and abstract, while visual data is redundant and complex, creating an inherent imbalance in learning dynamics. 2. Imbalanced Backbone Capabilities: The dominance of pretrained language models in MLLMs leads to overreliance on language and neglect of visual information. 3. Training Objectives: Current objectives often fail to promote balanced cross-modal alignment, resulting in shortcut learning biased toward language. These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs. We call for interdisciplinary efforts to tackle these challenges and drive innovation in MLLM research. Our work provides a fresh perspective on modality bias in MLLMs and offers insights for developing more robust and generalizable multimodal systems-advancing progress toward Artificial General Intelligence.
中文: 多模态大语言模型受模态偏差影响,因数据特性、不平衡的骨干网络能力和训练目标而过度依赖语言,需通过均衡策略改善多模态整合。
English: Multimodal Large Language Models (MLLMs) are hindered by modality bias, favoring language over visual inputs due to data characteristics, imbalanced backbone capabilities, and training objectives, necessitating balanced strategies for better multimodal integration.

Authors:Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu
Title: Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Abstract:
The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench
Chinese: 本文提出了首个评估多模态大语言模型全方位空间推理能力的基准OSR-Bench,发现尽管这些模型在标准图像上表现优异,但在全景环境中的空间推理仍存在明显不足。
English: This paper introduces OSR-Bench, the first benchmark for evaluating multimodal large language models' omnidirectional spatial reasoning capabilities, revealing their current limitations in panoramic contexts despite their proficiency with standard images.

Authors:Xu Zheng, Yuanhuiyi Lyu, Lutao Jiang, Danda Pani Paudel, Luc Van Gool, Xuming Hu
Title: Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization
Abstract:
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.
中文: 本研究提出了一种基于功能熵的即插即用正则化项,通过最大化各视觉模态的信息贡献来平衡语义分割中的多模态学习,无需额外参数即可有效缓解单模态主导问题,并在三个数据集上实现了显著的性能提升。
English: This study introduces a plug-and-play regularization term based on functional entropy to balance multimodal contributions in semantic segmentation, effectively mitigating unimodal dominance and enhancing robustness without extra parameters, as validated by significant performance gains on three datasets.

Authors:Jialei Chen, Xu Zheng, Dongyue Li, Chong Yi, Seigo Ito, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Title: Split Matching for Inductive Zero-shot Semantic Segmentation
Abstract:
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
中文: 本文提出分割匹配策略,将匈牙利匹配解耦为可见类和候选类两组独立优化,通过多尺度特征增强模块提升空间细节捕捉能力,在零样本语义分割中实现了最先进的性能。
English: This paper introduces Split Matching, a novel assignment strategy that decouples Hungarian matching into separate seen and unseen candidate groups to address overfitting in zero-shot semantic segmentation, achieving state-of-the-art results with enhanced multi-scale feature refinement.

Authors:Yuanhang Liu, Yanxing Huang, Yanqiao Wang, Peng Li, Yang Liu
Title: AI Mathematician: Towards Fully Automated Frontier Mathematical Research
Abstract:
Large Reasoning Models (LRMs) have made significant progress in mathematical capabilities in recent times. However, these successes have been primarily confined to competition-level problems. In this work, we propose AI Mathematician (AIM) framework, which harnesses the reasoning strength of LRMs to support frontier mathematical research. We have identified two critical challenges of mathematical research compared to competition, {\it the intrinsic complexity of research problems} and {\it the requirement of procedural rigor}. To address these challenges, AIM incorporates two core strategies: an exploration mechanism to foster longer solution paths, and the pessimistic reasonable verification method to ensure reliability. This early version of AIM already exhibits strong capability in tackling research-level tasks. We conducted extensive experiments across several real-world mathematical topics and obtained promising results. AIM is able to autonomously construct substantial portions of proofs and uncover non-trivial insights within each research area. These findings highlight the potential of LRMs in mathematical discovery and suggest that LRM-based agent systems could significantly accelerate mathematical research in the future.
中文摘要:AI数学家(AIM)框架利用大型推理模型,通过探索机制和验证方法应对前沿数学研究的复杂性与严谨性要求,在多个数学领域展现出自主构建证明和生成深刻见解的强大能力。
English Summary: The AI Mathematician (AIM) framework leverages Large Reasoning Models to address the complexity and rigor of frontier mathematical research through exploration mechanisms and verification methods, demonstrating strong capabilities in autonomously constructing proofs and generating insights across various topics.

Authors:Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, Yang Liu
Title: Visual Abstract Thinking Empowers Multimodal Reasoning
Abstract:
Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
中文摘要:视觉抽象思维(VAT)通过减少视觉冗余信息并聚焦关键要素来增强多模态推理能力,在多项推理任务中相比GPT-4o基准实现了17%的性能提升。
English Summary: Visual Abstract Thinking (VAT) enhances multimodal reasoning by reducing visual redundancy and focusing on essential elements, achieving a 17% performance gain over GPT-4o in various reasoning tasks.

Authors:Xurong Liang, Tong Chen, Wei Yuan, Hongzhi Yin
Title: Lightweight Embeddings with Graph Rewiring for Collaborative Filtering
Abstract:
As recommendation services scale rapidly and their deployment now commonly involves resource-constrained edge devices, GNN-based recommender systems face significant challenges, including high embedding storage costs and runtime latency from graph propagations. Our previous work, LEGCF, effectively reduced embedding storage costs but struggled to maintain recommendation performance under stricter storage limits. Additionally, LEGCF did not address the extensive runtime computation costs associated with graph propagation, which involves heavy multiplication and accumulation operations (MACs). These challenges consequently hinder effective training and inference on resource-constrained edge devices. To address these limitations, we propose Lightweight Embeddings with Rewired Graph for Graph Collaborative Filtering (LERG), an improved extension of LEGCF. LERG retains LEGCFs compositional codebook structure but introduces quantization techniques to reduce the storage cost, enabling the inclusion of more meta-embeddings within the same storage. To optimize graph propagation, we pretrain the quantized compositional embedding table using the full interaction graph on resource-rich servers, after which a fine-tuning stage is engaged to identify and prune low-contribution entities via a gradient-free binary integer programming approach, constructing a rewired graph that excludes these entities (i.e., user/item nodes) from propagating signals. The quantized compositional embedding table with selective embedding participation and sparse rewired graph are transferred to edge devices which significantly reduce computation memory and inference time. Experiments on three public benchmark datasets, including an industry-scale dataset, demonstrate that LERG achieves superior recommendation performance while dramatically reducing storage and computation costs for graph-based recommendation services.
中文摘要:LERG通过量化技术降低存储成本,并采用重布线图优化计算,在边缘设备上以更少资源实现了更优的推荐性能。
English Summary: LERG enhances LEGCF by introducing quantization to reduce storage and employing a rewired graph to minimize computation, achieving superior recommendation performance with significantly lower resource demands on edge devices.

Authors:Yihong Dong, Yuchen Liu, Xue Jiang, Zhi Jin, Ge Li
Title: Rethinking Repetition Problems of LLMs in Code Generation
Abstract:
With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.
Chinese: 本文提出RPG方法,通过基于语法的重复惩罚机制有效减少代码生成中的结构性重复,并在CodeRepetEval新数据集和主流基准测试中验证了其优越性能。
English: This paper introduces RPG, a grammar-based decoding method that effectively reduces structural repetitions in neural code generation by penalizing repetitive tokens, as validated on the new CodeRepetEval dataset and established benchmarks.

Authors:Hechuan Wen, Tong Chen, Mingming Gong, Li Kheng Chai, Shazia Sadiq, Hongzhi Yin
Title: Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective
Abstract:
Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the effect after treatment, e.g., expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more high-quality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures -- \textit{factual} and \textit{counterfactual covering radius} determine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into the \textit{Factual} and \textit{Counterfactual Coverage Maximization} to ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets.
中文: 尽管治疗效应估计的复杂算法不断进步,但高标注成本导致标记数据不足,限制了其效能,因此我们提出了一种名为FCCM的主动学习方法,通过最大化事实与反事实覆盖来优化数据采集,降低风险,并在合成数据集上展现出优于现有方法的性能。
English: Despite advances in complex algorithms for treatment effect estimation, their performance is hampered by limited labeled data due to high labeling costs, prompting the development of a new active learning method called FCCM that optimizes data acquisition by maximizing factual and counterfactual coverage to reduce risk and outperform existing approaches on synthetic datasets.

Authors:Lijian Chen, Wei Yuan, Tong Chen, Xiangyu Zhao, Nguyen Quoc Viet Hung, Hongzhi Yin
Title: Multi-agents based User Values Mining for Recommendation
Abstract:
Recommender systems have rapidly evolved and become integral to many online services. However, existing systems sometimes produce unstable and unsatisfactory recommendations that fail to align with users' fundamental and long-term preferences. This is because they primarily focus on extracting shallow and short-term interests from user behavior data, which is inherently dynamic and challenging to model. Unlike these transient interests, user values are more stable and play a crucial role in shaping user behaviors, such as purchasing items and consuming content. Incorporating user values into recommender systems can help stabilize recommendation performance and ensure results better reflect users' latent preferences. However, acquiring user values is typically difficult and costly. To address this challenge, we leverage the strong language understanding, zero-shot inference, and generalization capabilities of Large Language Models (LLMs) to extract user values from users' historical interactions. Unfortunately, direct extraction using LLMs presents several challenges such as length constraints and hallucination. To overcome these issues, we propose ZOOM, a zero-shot multi-LLM collaborative framework for effective and accurate user value extraction. In ZOOM, we apply text summarization techniques to condense item content while preserving essential meaning. To mitigate hallucinations, ZOOM introduces two specialized agent roles: evaluators and supervisors, to collaboratively generate accurate user values. Extensive experiments on two widely used recommendation datasets with two state-of-the-art recommendation models demonstrate the effectiveness and generalization of our framework in automatic user value mining and recommendation performance improvement.
中文: 现有推荐系统因侧重短期兴趣易产生不稳定推荐,而提出的ZOOM框架通过多LLM协作从用户历史交互中精准提取稳定的用户价值观,有效提升了推荐性能与质量。
English: Current recommender systems often yield unstable results by focusing on transient user interests, but the proposed ZOOM framework leverages multiple LLMs to accurately extract stable user values from historical interactions, significantly enhancing recommendation quality and performance.

Authors:Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao
Title: GenSpace: Benchmarking Spatially-Aware Image Generation
Abstract:
Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.
中文摘要:GenSpace是一种新颖的基准和评估流程,旨在全面评估当前图像生成模型的三维空间感知能力,发现尽管这些模型能生成视觉吸引人的图像,但在物体布局和空间关系等具体三维细节上仍存在明显不足。
English Summary: GenSpace is a new benchmark and evaluation pipeline designed to assess the 3D spatial awareness of AI image generators, revealing their limitations in object placement and spatial relationships despite producing visually appealing images.

Authors:Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
Title: MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
Abstract:
Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
中文:MARS-Bench是一个旨在评估大语言模型处理复杂多轮对话鲁棒性的新基准,研究表明闭源模型优于开源模型且显式推理能提升性能,但模型在处理动机转移和跨轮依赖时仍面临挑战,这源于长对话中注意力机制的退化。
English: MARS-Bench is a new benchmark designed to evaluate large language models' robustness in handling complex multi-turn dialogues, revealing that while closed-source models outperform open-source ones and explicit reasoning helps, these models still struggle with motivation transfer and cross-turn dependencies due to attention degradation in long sessions.

Authors:Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Dacheng Tao
Title: Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt
Abstract:
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.
中文: 本研究定量分析了推理大语言模型中因自我怀疑导致的过度思考问题,并提出通过质疑输入有效性来减少不必要推理步骤的提示方法,在多个数据集上实现了答案精简和性能提升。
English: This study quantitatively analyzes self-doubt as a cause of overthinking in reasoning large language models and introduces a prompting method that reduces unnecessary reasoning steps by questioning input validity, achieving shorter answers and improved performance across multiple datasets.

Authors:Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang
Title: Fostering Video Reasoning via Next-Event Prediction
Abstract:
Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
中文: 我们提出了下一事件预测(NEP)作为一种自监督学习任务,利用未来视频片段来增强多模态大语言模型(MLLM)的时间推理能力,并创建了V1-33K数据集和FutureBench评估基准。
English: Next-event prediction (NEP) is introduced as a self-supervised learning task that uses future video segments to enhance temporal reasoning in multimodal large language models (MLLMs), supported by the V1-33K dataset and evaluated with FutureBench.

Authors:Qihuang Zhong, Liang Ding, Fei Liao, Juhua Liu, Bo Du, Dacheng Tao
Title: Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning
Abstract:
Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.
中文: 领域特定的指令微调可提升大型语言模型在专业应用中的表现,但数据选择至关重要,以避免知识冲突并优化性能,因此提出的知识感知数据选择(KDS)框架能有效筛选高质量数据,减少幻觉问题,并增强模型在医疗问答等领域的泛化能力。
English: Domain-specific instruction-tuning enhances large language models (LLMs) for specialized applications, but data selection is crucial to avoid knowledge conflicts and improve performance, leading to the proposed Knowledge-aware Data Selection (KDS) framework that effectively selects high-quality data, reduces hallucinations, and boosts generalization in domains like medical question answering.

Authors:Qingyu Lu, Liang Ding, Siyi Cao, Xuebo Liu, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Title: Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments
Abstract:
Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an $\textbf{intrinsic}$ method that injects exit instructions during generation, and 2. an $\textbf{extrinsic}$ method that verifies task completion to determine when to halt an agent's trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of $\textbf{redundant steps}$ as a positive effect, and the other evaluates $\textbf{progress degradation}$ as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.
中文摘要:本研究针对基于大语言模型的智能体在具身环境中存在的效率低下问题,提出了两种互补的早退机制,通过实验验证能在保持性能的同时显著提升执行效率,并为后续研究提供了实用策略。
English Summary: This study introduces early-exit mechanisms for LLM-based agents to reduce redundant computational steps in embodied environments, demonstrating significant efficiency gains with minimal performance loss across multiple experimental settings.

Authors:Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, Zhoujun Li
Title: P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark
Abstract:
Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.
中文: P2P框架采用多智能体系统,直接从研究论文生成高质量学术海报,并配套发布大规模数据集和评估基准,推动该领域的发展与标准化。
English: The P2P framework introduces a multi-agent system to automatically generate high-quality academic posters from research papers, supported by a comprehensive dataset and evaluation benchmark to advance automated poster generation.

Authors:Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen
Title: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Abstract:
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
Chinese: QuickVideo是一种系统与算法协同设计,通过并行视频解码、KV缓存剪枝优化预填充以及重叠CPU与GPU操作,将长视频推理时间减少多达一分钟,支持实时应用。
English: QuickVideo is a system-algorithm co-design that accelerates long-video understanding by parallelizing video decoding, optimizing prefilling with KV-cache pruning, and overlapping CPU-GPU operations, reducing inference time by up to a minute for real-time applications.

Authors:Qihuang Zhong, Liang Ding, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao
Title: KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance
Abstract:
Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs' internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs' performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.
中文: 本文提出知识感知微调(KaFT)方法,通过根据知识冲突程度调整训练权重,有效提升大语言模型的性能,增强泛化能力并减少幻觉现象。
English: This paper introduces Knowledge-aware Fine-tuning (KaFT), a method that enhances large language models' performance by adjusting training weights based on knowledge conflict levels, leading to improved generalization and reduced hallucination.

Authors:Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang
Title: BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Abstract:
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
中文摘要:本文提出BanditSpec,一种免训练的在线学习框架,通过多臂老虎机算法自适应选择推测解码超参数,在保持文本生成质量的同时有效加速大语言模型推理。
English Summary: This paper introduces BanditSpec, a training-free online learning framework that adaptively selects speculative decoding hyperparameters using multi-armed bandit algorithms to accelerate LLM inference while maintaining text generation quality.

Authors:Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang, Siyu Yuan, Qianyu He, Xiangru Tang, Yingshui Tan, Wangchunshu Zhou, Zhaoxiang Zhang, Zhoujun Li, Wenhao Huang, Ge Zhang
Title: KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Abstract:
Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
中文: 近期大语言模型的进展要求更全面的评估方法,为此我们开发了KORGym动态评估平台,通过多种游戏测试推理能力,大量实验揭示了模型家族内一致的推理模式及闭源模型的优越表现。
English: Recent progress in large language models necessitates more thorough evaluation methods, leading to the development of KORGym, a dynamic platform that assesses reasoning across various games and reveals consistent patterns and superior performance of closed-source models in extensive experiments.

Authors:Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Title: Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Abstract:
Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.
通过引入可验证的密集奖励和分段优化策略,新方法在任意计算预算下提升大语言模型的推理效率与训练效果,显著优于传统固定预算方法。
Increasing test-time compute through reinforcement learning with dense, verifiable rewards optimizes LLMs for efficient, flexible reasoning under varying token budgets, improving both training and performance.

Authors:Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu
Title: BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models
Abstract:
Binary analysis remains pivotal in software security, offering insights into compiled programs without source code access. As large language models (LLMs) continue to excel in diverse language understanding and generation tasks, their potential in decoding complex binary data structures becomes evident. However, the lack of standardized benchmarks in this domain limits the assessment and comparison of LLM's capabilities in binary analysis and hinders the progress of research and practical applications. To bridge this gap, we introduce BinMetric, a comprehensive benchmark designed specifically to evaluate the performance of large language models on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, assembly instruction generation, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring the binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study provides valuable insights for the future development of these LLMs in software security.
中文摘要:BinMetric基准的推出填补了二进制分析领域缺乏标准化评估的空白,通过来自真实项目的1000个问题揭示了大型语言模型在反编译等任务中的潜力与当前局限,为软件安全研究提供了重要参考。
English Summary: The introduction of BinMetric, a comprehensive benchmark with 1,000 questions from real-world projects, addresses the lack of standardized evaluation for large language models in binary analysis, revealing both their potential and current limitations in tasks like decompilation and assembly synthesis.

Authors:Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, Gangyang Li, Xu Zhu, Weiming Zhang, Nenghai Yu
Title: CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System
Abstract:
With open-source projects growing in size and complexity, manual compilation becomes tedious and error-prone, highlighting the need for automation to improve efficiency and accuracy. However, the complexity of compilation instruction search and error resolution makes automatic compilation challenging. Inspired by the success of LLM-based agents in various fields, we propose CompileAgent, the first LLM-based agent framework dedicated to repo-level compilation. CompileAgent integrates five tools and a flow-based agent strategy, enabling interaction with software artifacts for compilation instruction search and error resolution. To measure the effectiveness of our method, we design a public repo-level benchmark CompileAgentBench, and we also design two baselines for comparison by combining two compilation-friendly schemes. The performance on this benchmark shows that our method significantly improves the compilation success rate, ranging from 10% to 71%. Meanwhile, we evaluate the performance of CompileAgent under different agent strategies and verify the effectiveness of the flow-based strategy. Additionally, we emphasize the scalability of CompileAgent, further expanding its application prospects.
Chinese: CompileAgent是首个基于LLM的仓库级编译代理框架,通过集成工具和基于流程的策略,在CompileAgentBench基准测试中将编译成功率显著提升了10%至71%。
English: CompileAgent is the first LLM-based agent framework designed for repo-level compilation, integrating tools and a flow-based strategy to significantly boost compilation success rates by 10% to 71%, as validated on the CompileAgentBench benchmark.

Authors:Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu
Title: FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Abstract:
Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
中文: 针对形式数学推理基准的不足,我们提出了FormalMATH这一大规模Lean4数据集,包含5,560个已验证问题,并通过高效的自动形式化流程降低了专家成本,同时揭示了当前AI定理证明器性能的显著缺陷及其在人类指导下的意外困境。
English: To tackle the limitations in formal mathematical reasoning benchmarks, we introduce FormalMATH, a large-scale Lean4 dataset with 5,560 verified problems and an efficient autoformalization pipeline that cuts expert costs while maintaining accuracy, revealing significant gaps in current AI theorem provers' performance and their unexpected struggles with human guidance.

Authors:Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
Title: QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
Abstract:
This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.
中文摘要:QwenLong-CPRS是一个通过动态上下文优化机制实现多粒度压缩的上下文压缩框架,在提升长序列处理效率的同时显著增强了模型性能,并在多项基准测试中超越了现有最优方法。
English Summary: QwenLong-CPRS is a context compression framework that optimizes long-context processing through dynamic compression mechanisms, achieving significant efficiency gains and performance improvements across multiple benchmarks while being compatible with various large language models.

Authors:Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
Title: QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Abstract:
Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.
中文: 近期大型推理模型在短上下文推理中表现出色,但处理长上下文输入仍面临挑战;QwenLong-L1通过渐进式上下文扩展框架解决了这一问题,在多项基准测试中超越主流模型,达到与Claude-3.7-Sonnet-Thinking相当的性能,推动了长上下文推理模型的发展。
English: Recent large reasoning models show strong short-context reasoning but struggle with long-context inputs, a gap addressed by QwenLong-L1, which adapts these models via progressive scaling and achieves top-tier performance on benchmarks, rivaling leading models like Claude-3.7-Sonnet-Thinking.

Authors:Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
Title: VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Abstract:
Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
中文:VLM-R³框架通过区域条件强化策略优化训练方法,赋予多模态大语言模型动态识别和推理区域的能力,在复杂视觉任务上实现了最先进的性能。
English: The VLM-R³ framework enhances MLLMs by enabling dynamic region recognition and reasoning through its R-GRPO training method, achieving state-of-the-art performance on complex visual tasks.

Authors:Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang
Title: Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
Abstract:
The exponential rise in mobile device usage necessitates streamlined automation for effective task management, yet many AI frameworks fall short due to inadequate operational expertise. While manually written knowledge can bridge this gap, it is often burdensome and inefficient. We introduce Mobile-Agent-V, an innovative framework that utilizes video as a guiding tool to effortlessly and efficiently inject operational knowledge into mobile automation processes. By deriving knowledge directly from video content, Mobile-Agent-V eliminates manual intervention, significantly reducing the effort and time required for knowledge acquisition. To rigorously evaluate this approach, we propose Mobile-Knowledge, a benchmark tailored to assess the impact of external knowledge on mobile agent performance. Our experimental findings demonstrate that Mobile-Agent-V enhances performance by 36% compared to existing methods, underscoring its effortless and efficient advantages in mobile automation.
中文摘要:Mobile-Agent-V是一种创新框架,利用视频内容自动为移动自动化注入操作知识,无需人工干预,性能比现有方法提升36%。
English Summary: Mobile-Agent-V is an innovative framework that leverages video content to automatically inject operational knowledge into mobile automation, eliminating manual efforts and boosting performance by 36% over existing methods.

Authors:Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang
Title: SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Abstract:
Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
中文:SoLoPO框架通过将长上下文偏好优化解耦为短上下文偏好优化和短到长奖励对齐,有效解决了大语言模型的长上下文对齐问题,显著提升了跨基准测试的性能和效率。
English: The proposed SoLoPO framework addresses long-context alignment challenges in LLMs by decoupling optimization into short-context preference optimization and short-to-long reward alignment, significantly improving efficiency and performance across benchmarks.

Authors:Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu
Title: Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents
Abstract:
Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants'', target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student\_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100\%$ improvement in simulation accuracy.
大型语言模型因倾向于生成完美答案而难以真实模拟学生多样化的认知能力,我们提出的免训练框架通过构建认知原型并迭代优化模拟解决方案,成功复现了学习过程中的自然缺陷。
Large language models struggle to realistically simulate students' diverse cognitive abilities due to their tendency to generate perfect responses, but our proposed training-free framework addresses this by constructing cognitive prototypes and iteratively refining simulated solutions to replicate natural learning imperfections.

Authors:Yuting Huang, Meitong Guo, Yiquan Wu, Ang Li, Xiaozhong Liu, Keting Yin, Changlong Sun, Fei Wu, Kun Kuang
Title: AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios
Abstract:
Recent advances in LegalAI have primarily focused on individual case judgment analysis, often overlooking the critical appellate process within the judicial system. Appeals serve as a core mechanism for error correction and ensuring fair trials, making them highly significant both in practice and in research. To address this gap, we present the AppealCase dataset, consisting of 10,000 pairs of real-world, matched first-instance and second-instance documents across 91 categories of civil cases. The dataset also includes detailed annotations along five dimensions central to appellate review: judgment reversals, reversal reasons, cited legal provisions, claim-level decisions, and whether there is new information in the second instance. Based on these annotations, we propose five novel LegalAI tasks and conduct a comprehensive evaluation across 20 mainstream models. Experimental results reveal that all current models achieve less than 50% F1 scores on the judgment reversal prediction task, highlighting the complexity and challenge of the appeal scenario. We hope that the AppealCase dataset will spur further research in LegalAI for appellate case analysis and contribute to improving consistency in judicial decision-making.
中文: AppealCase数据集填补了LegalAI在司法上诉流程研究中的空白,提供了涵盖91类民事案件的1万对匹配的一二审文书及五维标注,基于此提出的五项新任务揭示现有模型在上诉场景中表现不佳,尤其判决逆转预测任务的F1值不足50%。
English: The AppealCase dataset addresses the gap in LegalAI research on appellate processes by providing 10,000 matched first- and second-instance civil case documents with detailed annotations, enabling the evaluation of models on five novel tasks where current systems perform poorly, particularly in predicting judgment reversals.

Authors:Yuanai Xie, Zhaozhi Liu, Xiao Zhang, Shihua Zhang, Rui Hou, Minrui Xu, Ruichen Zhang, Dusit Niyato
Title: Shadow Wireless Intelligence: Large Language Model-Driven Reasoning in Covert Communications
Abstract:
Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness requirements. To bridge this gap, we introduce Shadow Wireless Intelligence (SWI), which integrates the reasoning capabilities of Large Language Models (LLMs) with retrieval-augmented generation to enable intelligent decision-making in covert wireless systems. Specifically, we utilize DeepSeek-R1, a mixture-of-experts-based LLM with RL-enhanced reasoning, combined with real-time retrieval of domain-specific knowledge to improve context accuracy and mitigate hallucinations. Our approach develops a structured CC knowledge base, supports context-aware retrieval, and performs semantic optimization, allowing LLMs to generate and adapt CC strategies in real time. In a case study on optimizing AN power in a full-duplex CC scenario, DeepSeek-R1 achieves 85% symbolic derivation accuracy and 94% correctness in the generation of simulation code, outperforming baseline models. These results validate SWI as a robust, interpretable, and adaptive foundation for LLM-driven intelligent covert wireless systems in 6G networks.
中文摘要:影子无线智能(SWI)将大语言模型与检索增强生成技术相结合,为6G网络中的隐蔽通信提供实时自适应决策支持,在策略优化和仿真精度方面显著优于传统方法。
English Summary: Shadow Wireless Intelligence (SWI) integrates Large Language Models with retrieval-augmented generation to enable real-time adaptive covert communications in 6G networks, achieving superior performance in strategy optimization and simulation accuracy compared to traditional methods.

Authors:Shunpu Tang, Yuanyuan Jia, Qianqian Yang, Ruichen Zhang, Jihong Park, Dusit Niyato
Title: Enabling Training-Free Semantic Communication Systems with Generative Diffusion Models
Abstract:
Semantic communication (SemCom) has recently emerged as a promising paradigm for next-generation wireless systems. Empowered by advanced artificial intelligence (AI) technologies, SemCom has achieved significant improvements in transmission quality and efficiency. However, existing SemCom systems either rely on training over large datasets and specific channel conditions or suffer from performance degradation under channel noise when operating in a training-free manner. To address these issues, we explore the use of generative diffusion models (GDMs) as training-free SemCom systems. Specifically, we design a semantic encoding and decoding method based on the inversion and sampling process of the denoising diffusion implicit model (DDIM), which introduces a two-stage forward diffusion process, split between the transmitter and receiver to enhance robustness against channel noise. Moreover, we optimize sampling steps to compensate for the increased noise level caused by channel noise. We also conduct a brief analysis to provide insights about this design. Simulations on the Kodak dataset validate that the proposed system outperforms the existing baseline SemCom systems across various metrics.
中文: 基于生成扩散模型的语义通信系统提供了一种无需训练的方法,通过增强对信道噪声的鲁棒性,在传输质量和效率上超越了现有基准系统。
English: Semantic communication systems enhanced by generative diffusion models offer a training-free approach that improves robustness against channel noise and outperforms existing baselines in transmission quality and efficiency.

Authors:Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Title: Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting
Abstract:
Long-term forecasting of chaotic systems remains a fundamental challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Conventional approaches, such as reservoir computing, typically require training data that incorporates long-term continuous dynamical behavior to comprehensively capture system dynamics. While advanced deep sequence models can capture transient dynamics within the training data, they often struggle to maintain predictive stability and dynamical coherence over extended horizons. Here, we propose PhyxMamba, a framework that integrates a Mamba-based state-space model with physics-informed principles to forecast long-term behavior of chaotic systems given short-term historical observations on their state evolution. We first reconstruct the attractor manifold with time-delay embeddings to extract global dynamical features. After that, we introduce a generative training scheme that enables Mamba to replicate the physical process. It is further augmented by multi-patch prediction and attractor geometry regularization for physical constraints, enhancing predictive accuracy and preserving key statistical properties of systems. Extensive experiments on simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior forecasting accuracy and faithfully captures essential statistics from short-term historical observations.
中文摘要:PhyxMamba是一种新颖的物理信息Mamba框架,通过重构吸引子流形并保持动力学不变量,能够基于短期观测数据实现对混沌系统的精准长期预测。
English Summary: PhyxMamba is a novel physics-informed Mamba framework that accurately forecasts chaotic systems long-term from short observations by reconstructing attractor manifolds and preserving dynamical invariants.

Authors:Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang
Title: Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Abstract:
Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.
中文:Point-RFT框架通过整合视觉基础思维链推理和强化微调,显著提升了视觉文档理解能力,在多个基准测试中实现了更高的准确率和优异的泛化性能。
English: The Point-RFT framework enhances visual document understanding by integrating visually grounded Chain-of-Thought reasoning and reinforcement finetuning, achieving significant accuracy improvements and superior generalization across diverse benchmarks.

Authors:Chonghua Han, Yuan Yuan, Kaiyan Chen, Jingtao Ding, Yong Li
Title: TrajMoE: Spatially-Aware Mixture of Experts for Unified Human Mobility Modeling
Abstract:
Modeling human mobility across diverse cities is essential for applications such as urban planning, transportation optimization, and personalized services. However, generalization remains challenging due to heterogeneous spatial representations and mobility patterns across cities. Existing methods typically rely on numerical coordinates or require training city-specific models, limiting their scalability and transferability. We propose TrajMoE, a unified and scalable model for cross-city human mobility modeling. TrajMoE addresses two key challenges: (1) inconsistent spatial semantics across cities, and (2) diverse urban mobility patterns. To tackle these, we begin by designing a spatial semantic encoder that learns transferable location representations from POI-based functional semantics and visit patterns. Furthermore, we design a Spatially-Aware Mixture-of-Experts (SAMoE) Transformer that injects structured priors into experts specialized in distinct mobility semantics, along with a shared expert to capture city-invariant patterns and enable adaptive cross-city generalization. Extensive experiments demonstrate that TrajMoE achieves up to 27% relative improvement over competitive mobility foundation models after only one epoch of fine-tuning, and consistently outperforms full-data baselines using merely 5% of target city data. These results establish TrajMoE as a significant step toward realizing a truly generalizable, transferable, and pretrainable foundation model for human mobility.
中文: MoveGPT作为人类移动性的大规模基础模型,通过统一位置编码器和空间感知专家混合Transformer克服了先前扩展性限制,在多项任务中实现最高性能(平均提升达35%),并对未见城市展现出强大泛化能力。
English: MoveGPT is a large-scale foundation model for human mobility that overcomes previous scaling limitations through a unified location encoder and a Spatially-Aware Mixture-of-Experts Transformer, achieving state-of-the-art performance with up to 35% average gains across various tasks and demonstrating strong generalization to unseen cities.

Authors:Chonghua Han, Yuan Yuan, Jingtao Ding, Jie Feng, Fanjin Meng, Yong Li
Title: MoveGPT: Scaling Mobility Foundation Models with Spatially-Aware Mixture of Experts
Abstract:
The success of foundation models in language has inspired a new wave of general-purpose models for human mobility. However, existing approaches struggle to scale effectively due to two fundamental limitations: a failure to use meaningful basic units to represent movement, and an inability to capture the vast diversity of patterns found in large-scale data. In this work, we develop MoveGPT, a large-scale foundation model specifically architected to overcome these barriers. MoveGPT is built upon two key innovations: (1) a unified location encoder that maps geographically disjoint locations into a shared semantic space, enabling pre-training on a global scale; and (2) a Spatially-Aware Mixture-of-Experts Transformer that develops specialized experts to efficiently capture diverse mobility patterns. Pre-trained on billion-scale datasets, MoveGPT establishes a new state-of-the-art across a wide range of downstream tasks, achieving performance gains of up to 35% on average. It also demonstrates strong generalization capabilities to unseen cities. Crucially, our work provides empirical evidence of scaling ability in human mobility, validating a clear path toward building increasingly capable foundation models in this domain.
中文: MoveGPT作为人类移动性的大规模基础模型,通过统一位置编码器和空间感知专家混合Transformer克服了先前扩展性限制,在多项任务中实现最高性能(平均提升达35%),并对未见城市展现出强大泛化能力。
English: MoveGPT is a large-scale foundation model for human mobility that overcomes previous scaling limitations through a unified location encoder and a Spatially-Aware Mixture-of-Experts Transformer, achieving state-of-the-art performance with up to 35% average gains across various tasks and demonstrating strong generalization to unseen cities.

Authors:Jiazhen Liu, Ruikun Li, Huandong Wang, Zihan Yu, Chang Liu, Jingtao Ding, Yong Li
Title: Beyond Equilibrium: Non-Equilibrium Foundations Should Underpin Generative Processes in Complex Dynamical Systems
Abstract:
This position paper argues that next-generation non-equilibrium-inspired generative models will provide the essential foundation for better modeling real-world complex dynamical systems. While many classical generative algorithms draw inspiration from equilibrium physics, they are fundamentally limited in representing systems with transient, irreversible, or far-from-equilibrium behavior. We show that non-equilibrium frameworks naturally capture non-equilibrium processes and evolving distributions. Through empirical experiments on a dynamic Printz potential system, we demonstrate that non-equilibrium generative models better track temporal evolution and adapt to non-stationary landscapes. We further highlight future directions such as integrating non-equilibrium principles with generative AI to simulate rare events, inferring underlying mechanisms, and representing multi-scale dynamics across scientific domains. Our position is that embracing non-equilibrium physics is not merely beneficial--but necessary--for generative AI to serve as a scientific modeling tool, offering new capabilities for simulating, understanding, and controlling complex systems.
中文: 本立场文件主张,非平衡启发的生成模型对于精确模拟现实世界复杂动力系统至关重要,因为它们克服了基于平衡方法的局限性,并能更好地模拟瞬态和演化过程。
English: This position paper advocates that non-equilibrium-inspired generative models are essential for accurately modeling real-world complex dynamical systems, as they overcome the limitations of equilibrium-based approaches and enable better simulation of transient and evolving processes.

Authors:Fanjin Meng, Jingtao Ding, Jiahui Gong, Chen Yang, Hong Chen, Zuojian Wang, Haisheng Lu, Yong Li
Title: Tuning Language Models for Robust Prediction of Diverse User Behaviors
Abstract:
Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.
中文: BehaviorLM提出了一种渐进式微调方法,先在保留通用知识的基础上针对常见行为进行模型微调,再通过平衡样本子集优化长尾行为预测,同时保持核心行为的预测能力。
English: BehaviorLM introduces a progressive fine-tuning method that first adapts LLMs to frequent behaviors while retaining general knowledge, then refines them with a balanced subset to enhance tail behavior predictions without compromising anchor performance.

Authors:Jiahui Gong, Jingtao Ding, Fanjin Meng, Chen Yang, Hong Chen, Zuojian Wang, Haisheng Lu, Yong Li
Title: BehaveGPT: A Foundation Model for Large-scale User Behavior Modeling
Abstract:
In recent years, foundational models have revolutionized the fields of language and vision, demonstrating remarkable abilities in understanding and generating complex data; however, similar advances in user behavior modeling have been limited, largely due to the complexity of behavioral data and the challenges involved in capturing intricate temporal and contextual relationships in user activities. To address this, we propose BehaveGPT, a foundational model designed specifically for large-scale user behavior prediction. Leveraging transformer-based architecture and a novel pretraining paradigm, BehaveGPT is trained on vast user behavior datasets, allowing it to learn complex behavior patterns and support a range of downstream tasks, including next behavior prediction, long-term generation, and cross-domain adaptation. Our approach introduces the DRO-based pretraining paradigm tailored for user behavior data, which improves model generalization and transferability by equitably modeling both head and tail behaviors. Extensive experiments on real-world datasets demonstrate that BehaveGPT outperforms state-of-the-art baselines, achieving more than a 10% improvement in macro and weighted recall, showcasing its ability to effectively capture and predict user behavior. Furthermore, we measure the scaling law in the user behavior domain for the first time on the Honor dataset, providing insights into how model performance scales with increased data and parameter sizes.
中文摘要:BehaveGPT是一种基于Transformer的基础模型,通过创新的DRO预训练范式提升用户行为预测能力,在召回率指标上实现超过10%的提升,并能有效捕捉行为数据中的复杂时序模式。
English Summary: BehaveGPT is a transformer-based foundational model that advances user behavior prediction by employing a novel DRO-based pretraining paradigm, achieving over 10% improvement in recall metrics and effectively capturing complex temporal patterns in behavioral data.

Authors:Haoxin Li, Jingtao Ding, Jiahui Gong, Yong Li
Title: Large language model as user daily behavior data generator: balancing population diversity and individual personality
Abstract:
Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.
中文: BehaviorGen框架利用大型语言模型生成高质量合成行为数据,在提升人类移动和智能手机使用预测准确率高达18.9%的同时,有效解决了隐私保护和数据可用性问题。
English: BehaviorGen is a framework that uses large language models to generate high-quality synthetic behavior data, enhancing prediction models for human mobility and smartphone usage with up to 18.9% improvement while addressing privacy concerns.

Authors:Can Rong, Jingtao Ding, Meng Li, Yong Li
Title: A Global Commuting Origin-Destination Flow Dataset for Urban Sustainable Development
Abstract:
Commuting Origin-Destination (OD) flows capture movements of people from residences to workplaces, representing the predominant form of intra-city mobility and serving as a critical reference for understanding urban dynamics and supporting sustainable policies. However, acquiring such data requires costly, time-consuming censuses. In this study, we introduce a commuting OD flow dataset for cities around the world, spanning 6 continents, 179 countries, and 1,625 cities, providing unprecedented coverage of dynamics under diverse urban environments. Specifically, we collected fine-grained demographic data, satellite imagery, and points of interest~(POIs) for each city as foundational inputs to characterize the functional roles of urban regions. Leveraging these, a deep generative model is employed to capture the complex relationships between urban geospatial features and human mobility, enabling the generation of commuting OD flows between urban regions. Comprehensively, validation shows that the spatial distributions of the generated flows closely align with real-world observations. We believe this dataset offers a valuable resource for advancing sustainable urban development research in urban science, data science, transportation engineering, and related fields.
中文摘要:本研究通过整合人口统计、卫星图像和兴趣点数据,利用深度学习模型生成了全球通勤OD流数据集,为城市流动性分析提供了可扩展的替代方案,有效克服了传统普查方法的局限性。
English Summary: This study introduces a global commuting OD flow dataset generated using a deep learning model that integrates demographic, satellite, and POI data, providing a scalable alternative to traditional census methods for urban mobility analysis.

Authors:Yuhao Zhang, Xiangnan Ma, Kaiqi Kou, Peizhuo Liu, Weiqiao Shan, Benyou Wang, Tong Xiao, Yuxin Huang, Zhengtao Yu, Jingbo Zhu
Title: Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
Abstract:
The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.
中文: 本研究提出了一种采用n元语言建模和任务提示的单位语言方法,以解决无文本语音翻译中的跨模态和跨语言挑战,在Voxpupil数据集上实现了显著改进,并达到了与基于文本模型相当的性能。
English: The study introduces a unit language approach using n-gram modeling and task prompts to address cross-modal and cross-lingual challenges in textless speech-to-speech translation, showing significant improvements and performance comparable to text-based models on the Voxpupil dataset.

Authors:Ruikun Li, Huandong Wang, Jingtao Ding, Yuan Yuan, Qingmin Liao, Yong Li
Title: Predicting Dynamical Systems across Environments via Diffusive Model Weight Generation
Abstract:
Data-driven methods offer an effective equation-free solution for predicting physical dynamics. However, the same physical system can exhibit significantly different dynamic behaviors in various environments. This causes prediction functions trained for specific environments to fail when transferred to unseen environments. Therefore, cross-environment prediction requires modeling the dynamic functions of different environments. In this work, we propose a model weight generation method, \texttt{EnvAd-Diff}. \texttt{EnvAd-Diff} operates in the weight space of the dynamic function, generating suitable weights from scratch based on environmental condition for zero-shot prediction. Specifically, we first train expert prediction functions on dynamic trajectories from a limited set of visible environments to create a model zoo, thereby constructing sample pairs of prediction function weights and their corresponding environments. Subsequently, we train a latent space diffusion model conditioned on the environment to model the joint distribution of weights and environments. Considering the lack of environmental prior knowledge in real-world scenarios, we propose a physics-informed surrogate label to distinguish different environments. Generalization experiments across multiple systems demonstrate that a 1M parameter prediction function generated by \texttt{EnvAd-Diff} outperforms a pre-trained 500M parameter foundation model.
中文: 本文提出的EnvAd-Diff方法通过基于专家函数训练的扩散模型,根据环境条件生成定制化模型权重,实现了无需预训练的跨环境零样本预测,其性能优于参数规模大得多的基础模型。
English: The proposed EnvAd-Diff method generates customized model weights from environmental conditions using a diffusion model trained on expert functions, enabling zero-shot cross-environment predictions that outperform much larger foundation models.

Authors:Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, Boqing Gong
Title: SITE: towards Spatial Intelligence Thorough Evaluation
Abstract:
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
中文: 我们推出了SITE基准数据集,通过标准化多选题形式全面评估视觉语言模型的空间智能,发现现有模型在空间定向能力上落后于人类,并揭示了空间推理与具身AI性能之间的正相关关系。
English: The SITE benchmark is introduced to thoroughly evaluate spatial intelligence in vision-language models through multi-choice visual questions, revealing that current models lag behind humans in spatial orientation and showing a correlation between spatial reasoning and embodied AI performance.

Authors:Xingguang Wei, Haomin Wang, Shenglong Ye, Ruifeng Luo, Yanting Zhang, Lixin Gu, Jifeng Dai, Yu Qiao, Wenhai Wang, Hongjie Zhang
Title: Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
Abstract:
We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.
中文: 本文提出VecFormer方法,通过基于线条的图元表示解决CAD图纸全景符号识别问题,既保持几何连续性又通过分支融合优化模块整合实例与语义预测,以91.1 PQ的成绩创下最新性能记录。
English: This paper introduces VecFormer, a line-based representation method for panoptic symbol spotting in CAD drawings that preserves geometric continuity and integrates instance/semantic predictions through a Branch Fusion Refinement module, achieving state-of-the-art performance with 91.1 PQ.

Authors:Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami
Title: OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation
Abstract:
Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one-shot setting, the agent generates a policy after observing a single expert demonstration without additional fine-tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with different semantic or structural requirements. While some recent methods attempt to address this, they exhibit low success rates on hard test tasks that, despite being visually similar to some training tasks, differ in context and require distinct responses. Additionally, most existing methods lack an explicit model of environment dynamics, limiting their ability to reason about future states. To address these limitations, we propose a novel framework for one-shot visual imitation learning via world-model-guided trajectory generation. Given an expert demonstration video and the agent's initial observation, our method leverages a learned world model to predict a sequence of latent states and actions. This latent trajectory is then decoded into physical waypoints that guide the agent's execution. Our method is evaluated on two simulated benchmarks and three real-world robotic platforms, where it consistently outperforms prior approaches, with over 30% improvement in some cases.
中文: 本文提出了一种新颖的单次视觉模仿学习框架,通过习得的世界模型从专家演示生成潜在轨迹,在模拟和真实机器人任务中相比现有方法实现了超过30%的性能提升。
English: This paper introduces a novel framework for one-shot visual imitation learning that uses a learned world model to generate latent trajectories from expert demonstrations, achieving over 30% improvement in performance on simulated and real-world robotic tasks compared to prior methods.

Authors:Ruiyang Xia, Dawei Zhou, Decheng Liu, Lin Yuan, Jie Li, Nannan Wang, Xinbo Gao
Title: Towards Generalized Proactive Defense against Face Swapping with Contour-Hybrid Watermark
Abstract:
Face swapping, recognized as a privacy and security concern, has prompted considerable defensive research. With the advancements in AI-generated content, the discrepancies between the real and swapped faces have become nuanced. Considering the difficulty of forged traces detection, we shift the focus to the face swapping purpose and proactively embed elaborate watermarks against unknown face swapping techniques. Given that the constant purpose is to swap the original face identity while preserving the background, we concentrate on the regions surrounding the face to ensure robust watermark generation, while embedding the contour texture and face identity information to achieve progressive image determination. The watermark is located in the facial contour and contains hybrid messages, dubbed the contour-hybrid watermark (CMark). Our approach generalizes face swapping detection without requiring any swapping techniques during training and the storage of large-scale messages in advance. Experiments conducted across 8 face swapping techniques demonstrate the superiority of our approach compared with state-of-the-art passive and proactive detectors while achieving a favorable balance between the image quality and watermark robustness.
中文摘要:本研究提出一种名为CMark的主动水印方法,通过在面部轮廓嵌入混合信息来检测人脸替换,无需预知替换技术即可实现优越性能,并在图像质量与鲁棒性之间取得良好平衡。
English Summary: This study introduces a proactive watermarking method called CMark, which embeds hybrid messages in facial contours to detect face swapping without prior knowledge of swapping techniques, achieving superior performance and balancing image quality with robustness.

Authors:Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han
Title: Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Abstract:
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
中文: 表征干预旨在定位和修改大语言模型中的概念表征以引导预期行为,但在非线性设置中难以忠实定位概念,因此提出COCA方法,通过重构训练数据简化有害概念的线性擦除,有效提升安全性同时保持模型性能。
English: Representation intervention in LLMs seeks to modify concept representations for aligned behavior, but achieving faithful concept location is challenging in non-linear settings, prompting the proposed COCA method that reframes training data to simplify harmful concept erasure and enhance safety without compromising performance.

Authors:Patanjali Maithani, Aliasghar Arab, Farshad Khorrami, Prashanth Krishnamurthy
Title: Proactive Hierarchical Control Barrier Function-Based Safety Prioritization in Close Human-Robot Interaction Scenarios
Abstract:
In collaborative human-robot environments, the unpredictable and dynamic nature of human motion can lead to situations where collisions become unavoidable. In such cases, it is essential for the robotic system to proactively mitigate potential harm through intelligent control strategies. This paper presents a hierarchical control framework based on Control Barrier Functions (CBFs) designed to ensure safe and adaptive operation of autonomous robotic manipulators during close-proximity human-robot interaction. The proposed method introduces a relaxation variable that enables real-time prioritization of safety constraints, allowing the robot to dynamically manage collision risks based on the criticality of different parts of the human body. A secondary constraint mechanism is incorporated to resolve infeasibility by increasing the priority of imminent threats. The framework is experimentally validated on a Franka Research 3 robot equipped with a ZED2i AI camera for real-time human pose and body detection. Experimental results confirm that the CBF-based controller, integrated with depth sensing, facilitates responsive and safe human-robot collaboration, while providing detailed risk analysis and maintaining robust performance in highly dynamic settings.
中文摘要:本文提出了一种基于控制屏障函数的分层控制框架,通过实时优先级调整安全约束和深度感知技术,在近距离人机交互中实现自主机械臂的安全自适应操作。
English Summary: This paper introduces a hierarchical control framework using Control Barrier Functions to enable safe human-robot collaboration by dynamically prioritizing safety constraints and managing collision risks in real-time through depth sensing and pose detection.

Authors:Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami
Title: RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
Abstract:
Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2D segmentation inconsistencies. The proposed general-purpose 3D scene understanding framework can be used for various tasks including zero-shot 3D instance retrieval, segmentation, and object detection to reason about previously unseen objects and interpret natural language queries. The project page is available at https://razer-3d.github.io.
Chinese: 本文提出了一种零样本框架,将实时三维几何重建与开放词汇语义理解相结合,无需预先训练即可实现动态地图构建和自然语言交互。
English: This paper introduces a zero-shot framework that integrates real-time 3D geometric reconstruction with open-vocabulary semantic understanding, enabling dynamic mapping and natural language interaction without prior training.

Authors:Huaijie Wang, De Cheng, Guozhang Li, Zhipeng Xu, Lingfeng He, Jie Li, Nannan Wang, Xinbo Gao
Title: StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning
Abstract:
Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering semantic sensitivity and classification contribution. These important semantic channels are selectively regularized to maintain prior knowledge while allowing for adaptation. Second, we design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts based on their temporal dynamics, enabling inference without task ID or stored exemplars. Together, StPR effectively leverages spatial semantics and temporal dynamics, achieving a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.
中文: 提出的时空保持与路由(StPR)框架通过帧共享语义蒸馏和基于时序分解的专家混合机制,无需样本回放即可有效保持时空信息,在多个数据集的视频类增量学习中实现了优越性能。
English: The proposed Spatiotemporal Preservation and Routing (StPR) framework introduces Frame-Shared Semantics Distillation and Temporal Decomposition-based Mixture-of-Experts to effectively preserve spatiotemporal information without exemplars, achieving superior performance in video class-incremental learning across multiple datasets.

Authors:Huaijie Wang, De Cheng, Guozhang Li, Zhipeng Xu, Lingfeng He, Jie Li, Nannan Wang, Xinbo Gao
Title: StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning
Abstract:
Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering semantic sensitivity and classification contribution. These important semantic channels are selectively regularized to maintain prior knowledge while allowing for adaptation. Second, we design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts based on their temporal dynamics, enabling inference without task ID or stored exemplars. Together, StPR effectively leverages spatial semantics and temporal dynamics, achieving a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.
中文: 提出的时空保持与路由(StPR)框架通过帧共享语义蒸馏和基于时序分解的专家混合机制,无需样本回放即可有效保持时空信息,在多个数据集的视频类增量学习中实现了优越性能。
English: The proposed Spatiotemporal Preservation and Routing (StPR) framework introduces Frame-Shared Semantics Distillation and Temporal Decomposition-based Mixture-of-Experts to effectively preserve spatiotemporal information without exemplars, achieving superior performance in video class-incremental learning across multiple datasets.

Authors:Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang
Title: On the Thinking-Language Modeling Gap in Large Language Models
Abstract:
System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.
中文: 本研究揭示了大型语言模型在语言与思维建模之间存在偏差,提出“思维语言”提示技术以减少语言模型偏见,从而提升各类推理任务的性能。
English: This study reveals that large language models (LLMs) exhibit a gap between language and thought modeling, leading to biased reasoning, and proposes a Language-of-Thoughts (LoT) prompting technique to reduce biases and enhance performance across reasoning tasks.

Authors:Peichao Lai, Kexuan Zhang, Yi Lin, Linyihan Zhang, Feiyang Ye, Jinhao Yan, Yanwei Xu, Conghui He, Yilei Wang, Wentao Zhang, Bin Cui
Title: SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Abstract:
Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.
中文:SAS-Bench作为专为大语言模型短答案评分设计的基准,通过细粒度评分和专家标注解决现有方法的偏差与透明度问题,实验揭示了科学类题目的评分挑战及小样本提示的有效提升作用。
English: SAS-Bench is introduced as a specialized benchmark for LLM-based short answer scoring, offering fine-grained evaluation and expert annotations to address biases and transparency issues, with experiments showing challenges in science questions and the benefits of few-shot prompting.

Authors:Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
Title: 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks
Abstract:
Robotic manipulation in 3D requires learning an $N$ degree-of-freedom joint space trajectory of a robot manipulator. Robots must possess semantic and visual perception abilities to transform real-world mappings of their workspace into the low-level control necessary for object manipulation. Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models (VLMs) to learn the mapping between RGB images, language instructions, and joint space control. These models typically take as input RGB images of the workspace and language instructions, and are trained on large datasets of teleoperated robot demonstrations. In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model by integrating chain-of-thought reasoning, depth perception, and task-oriented region of interest detection. Our experiments in the LIBERO simulation environment show that our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$\%$. We also evaluate the zero-shot capabilities of our method, demonstrating that 3D scene awareness leads to robust learning and adaptation for completely unseen tasks. 3D-CAVLA achieves an absolute improvement of 8.8$\%$ on unseen tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: https://3d-cavla.github.io
Chinese: 本研究提出3D-CAVLA模型,通过整合思维链推理、深度感知和任务导向的感兴趣区域检测,显著提升了机器人操作在仿真环境中已知任务和未知任务的成功率。
English: The study introduces 3D-CAVLA, an enhanced vision-language-action model that integrates chain-of-thought reasoning, depth perception, and task-oriented detection to significantly improve robotic manipulation success rates in both seen and unseen tasks within simulation environments.

Authors:Xi Yang, Songsong Duan, Nannan Wang, Xinbo Gao
Title: Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
Abstract:
Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.
中文摘要:本文提出Pro2SAM网络,通过结合SAM模型与掩码提示策略及网格点,改进弱监督目标定位方法,解决了现有技术中精细信息缺失的问题,在标准数据集上实现了最优的定位性能。
English Summary: This paper introduces Pro2SAM, a novel network that enhances Weakly Supervised Object Localization by integrating the Segment Anything Model with a mask prompt strategy and grid points to overcome limitations in existing methods, achieving state-of-the-art localization accuracy on benchmark datasets.

Authors:Songsong Duan, Xi Yang, Nannan Wang, Xinbo Gao
Title: Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective
Abstract:
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS.
中文摘要:本研究提出的SATNet通过提升深度质量、优化模态融合和增强特征表示,在轻量级RGB-D显著目标检测中实现了效率与精度的平衡,仅用520万参数和415 FPS就达到了最先进性能。
English Summary: The proposed SATNet balances efficiency and accuracy in lightweight RGB-D salient object detection by improving depth quality, modality fusion, and feature representation, achieving state-of-the-art performance with only 5.2M parameters and 415 FPS.

Authors:Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, Fei Huang
Title: EvolveSearch: An Iterative Self-Evolving Search Agent
Abstract:
The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
Chinese: 提出的EvolveSearch框架结合监督微调与强化学习,通过迭代自我进化增强大语言模型的网络搜索能力,在无需人工标注数据的情况下,于七大基准测试中平均性能比现有最优方法提升4.7%。
English: The proposed EvolveSearch framework combines supervised fine-tuning and reinforcement learning to iteratively enhance large language models' web search capabilities, achieving a 4.7% average improvement over state-of-the-art methods across seven benchmarks without human-annotated data.

Authors:Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Title: RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning
Abstract:
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
中文: RAG-Zeval是一种新颖的端到端框架,通过强化学习训练紧凑模型,以规则引导的推理方式评估检索增强生成系统,在显著降低计算成本的同时,实现了优于大型模型的性能表现和评估可解释性。
English: RAG-Zeval is a novel end-to-end framework that trains compact models using reinforcement learning to evaluate retrieval-augmented generation systems through rule-guided reasoning, achieving superior performance and interpretability while significantly reducing computational costs compared to larger models.

Authors:Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
Title: Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
Abstract:
Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.
中文: 本文提出ACTIVE-O3强化学习框架,旨在赋予多模态大语言模型主动感知能力以提升搜索效率和区域选择精度,并通过综合基准测试验证其性能。
English: This paper introduces ACTIVE-O3, a reinforcement learning framework that equips Multimodal Large Language Models with active perception capabilities to improve search efficiency and accuracy, supported by a comprehensive benchmark for evaluation.

Authors:Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen
Title: Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
Abstract:
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.
中文摘要:该研究提出Omni-R1框架,通过强化学习协调全局推理与细节理解双系统,在保持高分辨率精准定位的同时实现长序列高效处理,在多项挑战性任务中超越现有方法并显著提升泛化能力。
English Summary: The study introduces Omni-R1, a two-system framework using reinforcement learning to balance global video-audio reasoning and fine-grained pixel understanding, achieving superior performance on challenging benchmarks while enhancing generalization and reducing hallucinations.

Authors:Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, Tat-Seng Chua
Title: Reinforced Latent Reasoning for LLM-based Recommendation
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as a small set of latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose $\textit{\underline{R}einforced \underline{Latent} \underline{R}easoning for \underline{R}ecommendation}$ (LatentR$^3$), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT data.LatentR$^3$ adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR$^3$ enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at https://anonymous.4open.science/r/R3-A278/.
中文: 本文提出LatentR³框架,通过强化学习优化潜在推理过程,无需依赖思维链数据即可有效提升基于大语言模型的推荐系统性能。
English: This paper introduces LatentR³, a novel reinforcement learning framework that replaces explicit chain-of-thought reasoning with efficient latent reasoning to enhance LLM-based recommendations without requiring CoT data.

Authors:Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Title: MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
Abstract:
Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
中文摘要: 尽管大语言模型在复杂推理任务中展现出潜力,但当前评估主要关注单轮推理,缺乏对交互式任务的探索;为此提出的MTR-Bench通过自动化多轮推理评估框架发现,即使最先进的推理模型也难以胜任交互式任务。
English Summary: Recent advances in Large Language Models show promise in complex reasoning, but current evaluations overlook interactive tasks, prompting the creation of MTR-Bench for automated multi-turn reasoning assessment, which reveals that even top models struggle with interactive reasoning.

Authors:Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, Pheng-Ann Heng
Title: EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
Abstract:
Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.
中文: EchoInk-R1是一个强化学习框架,通过轻量级微调和反思性响应优化,在多模态大语言模型中提升了音视频问答的推理能力,验证集准确率达到85.77%。
English: EchoInk-R1 is a reinforcement learning framework that enhances multimodal reasoning in large language models, achieving 85.77% accuracy on audio-visual question answering through lightweight fine-tuning and reflective response refinement.

Authors:Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou
Title: ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Abstract:
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
中文: ZeroSearch框架通过模拟搜索和渐进式课程策略,解决了大语言模型在强化学习中面临的文档质量不可控和API成本过高的问题,有效提升了检索能力,其性能可媲美甚至超越真实搜索引擎。
English: The ZeroSearch framework addresses challenges of unpredictable document quality and high API costs in reinforcement learning for large language models by using simulated searches and a curriculum-based strategy to progressively enhance retrieval capabilities, achieving performance comparable to or surpassing real search engines.

Authors:Jinglong Gao, Xiao Ding, Lingxiao Zou, Bing Qin, Ting Liu
Title: CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer
Abstract:
In-Context Learning (ICL) enhances the performance of large language models (LLMs) with demonstrations. However, obtaining these demonstrations primarily relies on manual effort. In most real-world scenarios, users are often unwilling or unable to provide such demonstrations. Inspired by the human analogy, we explore a new ICL paradigm CrossICL to study how to utilize existing source task demonstrations in the ICL for target tasks, thereby obtaining reliable guidance without any additional manual effort. To explore this, we first design a two-stage alignment strategy to mitigate the interference caused by gaps across tasks, as the foundation for our experimental exploration. Based on it, we conduct comprehensive exploration of CrossICL, with 875 NLP tasks from the Super-NI benchmark and six types of LLMs, including GPT-4o. Experimental results demonstrate the effectiveness of CrossICL and provide valuable insights on questions like the criteria for selecting cross-task demonstrations, as well as the types of task-gap-induced interference in CrossICL.
Chinese: CrossICL提出了一种新的上下文学习范式,通过利用源任务的演示来指导目标任务而无需人工参与,采用两阶段对齐策略弥合任务差异,并在875个NLP任务和多种大语言模型上的实验验证了其有效性。
English: CrossICL introduces a novel in-context learning paradigm that leverages demonstrations from source tasks to guide target tasks without manual effort, using a two-stage alignment strategy to bridge task gaps and demonstrating effectiveness through extensive experiments with 875 NLP tasks and multiple LLMs.

Authors:Jinglong Gao, Xiao Ding, Lingxiao Zou, Bibo Cai, Bing Qin, Ting Liu
Title: ExpeTrans: LLMs Are Experiential Transfer Learners
Abstract:
Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.
中文: 本研究设计了一种自主经验迁移框架,使大语言模型能够模拟人类认知智能,将已有任务经验自主迁移至新任务,不仅降低了传统方法的高成本,还为模型泛化提供了新路径。
English: This study introduces an autonomous experience transfer framework that enables large language models to independently apply knowledge from source tasks to new target tasks, enhancing performance without the high costs of traditional methods and offering a novel approach for model generalization.

Authors:Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, Yunchao Wei
Title: OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning
Abstract:
While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.
中文:OmniAD是一个多模态框架,通过结合视觉与文本推理来提升异常检测与理解能力,并凭借创新的训练策略在多个基准测试中取得领先性能。
English: OmniAD is a multimodal framework that integrates visual and textual reasoning to advance anomaly detection and understanding, achieving top performance on benchmarks through innovative training strategies.

Authors:Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei
Title: AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment
Abstract:
Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
中文: AlignGen提出了一种跨模态先验对齐机制,通过可学习令牌、鲁棒训练策略和选择性注意力掩码,解决了个性化图像生成中提示与参考图像不匹配时文本先验过强的问题,其性能优于现有零样本方法。
English: AlignGen introduces a cross-modality prior alignment mechanism to address the bias toward textual priors in personalized image generation when prompts and reference images are misaligned, outperforming existing zero-shot methods through learnable tokens, robust training, and selective attention masks.

Authors:Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
Title: Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning
Abstract:
While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.
中文摘要:Self-Route是一种动态推理框架,通过基于实时能力评估自动选择通用模式与推理模式,在保持相当准确率的同时,成功将大型语言模型的令牌消耗降低了30-55%。
English Summary: Self-Route is a dynamic reasoning framework that reduces unnecessary token consumption in large language models by automatically selecting between general and reasoning modes based on real-time capability assessment, achieving 30-55% token reduction while maintaining comparable accuracy.

Authors:Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu
Title: How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
Abstract:
Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.
中文: 本研究通过创新的有限领域预训练设置发现,序列建模架构对预训练语言模型的基础能力具有重要影响,缺乏全序列任意选择能力的架构会出现性能退化,并提出并验证了基于简单选择架构的设计原则。
English: This study identifies that sequence modeling architectures significantly impact the base capabilities of pre-trained language models, revealing through a novel limited domain pre-training setting that architectures lacking full-sequence arbitrary selection capability suffer performance degradation, and proposes a design principle validated by simple selection architectures.

Authors:Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Title: MPO: Multilingual Safety Alignment via Reward Gap Optimization
Abstract:
Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility.
中文: 本研究提出多语言奖励差距优化方法(MPO),通过利用英语的安全对齐能力来提升多语言安全性能,同时保持模型原有的通用多语言效用。
English: To address the limitations of monolingual safety alignment methods in multilingual contexts, this study introduces Multilingual reward gaP Optimization (MPO), which leverages English's safety capabilities to enhance alignment across languages while preserving general utility.

Authors:Zhouhao Sun, Zhiyuan Kan, Xiao Ding, Li Du, Yang Zhao, Bing Qin, Ting Liu
Title: Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing
Abstract:
Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.
中文摘要:现有大语言模型因多重偏见影响泛化能力,为此提出多偏见基准和基于因果效应的消除方法,有效提升了模型的泛化性能。
English Summary: Current large language models struggle with multiple biases affecting their generalizability, prompting the development of a multi-bias benchmark and a causal effect-based elimination method that effectively enhances model performance.

Authors:Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo, Haixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, Ting Liu
Title: Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Abstract:
Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
中文: RLPA框架通过强化学习在对话中动态推断和优化用户画像,实现了最先进的个性化对齐效果,其效能与效率均超越现有方法及商业模型。
English: The RLPA framework employs reinforcement learning to dynamically infer and refine user profiles through dialogue, achieving state-of-the-art personalized alignment that surpasses existing methods and commercial models in effectiveness and efficiency.

Authors:Weixiang Zhao, Jiahe Guo, Yang Deng, Tongtong Wu, Wenxuan Zhang, Yulin Hu, Xingyu Sui, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu
Title: When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Abstract:
Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-source LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively decoupled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training such as supervised fine-tuning or reinforcement learning, our training-free ablation achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.
Chinese: 通过因果干预分离语言特定表征与推理过程,大型语言模型能够在不增加训练的情况下,有效提升多语言推理能力,并在多种语言中实现性能提升。
English: Large language models can enhance multilingual reasoning by separating language-specific representations from reasoning processes, as shown through causal interventions that improve performance across diverse languages without additional training.

Authors:Haochun Wang, Sendong Zhao, Jingbo Wang, Zewen Qiang, Bing Qin, Ting Liu
Title: Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems
Abstract:
Multi-agent collaboration has emerged as a pivotal paradigm for addressing complex, distributed tasks in large language model (LLM)-driven applications. While prior research has focused on high-level architectural frameworks, the granular mechanisms governing agents, critical to performance and scalability, remain underexplored. This study systematically investigates four dimensions of collaboration strategies: (1) agent governance, (2) participation control, (3) interaction dynamics, and (4) dialogue history management. Through rigorous experimentation under two context-dependent scenarios: Distributed Evidence Integration (DEI) and Structured Evidence Synthesis (SES), we quantify the impact of these strategies on both task accuracy and computational efficiency. Our findings reveal that centralized governance, instructor-led participation, ordered interaction patterns, and instructor-curated context summarization collectively optimize the trade-off between decision quality and resource utilization with the support of the proposed Token-Accuracy Ratio (TAR). This work establishes a foundation for designing adaptive, scalable multi-agent systems, shifting the focus from structural novelty to strategic interaction mechanics.
中文: 本研究系统探讨了多智能体系统中四个关键协作策略——智能体治理、参与控制、交互动态和对话历史管理,通过实验证明集中式治理与结构化交互能在决策质量与计算效率之间实现最优平衡。
English: This study systematically examines four key collaboration strategies in multi-agent systems—agent governance, participation control, interaction dynamics, and dialogue history management—demonstrating through experiments that centralized governance and structured interactions optimize the balance between decision quality and computational efficiency.

Authors:Yang Zhao, Kai Xiong, Xiao Ding, Li Du, YangouOuyang, Zhouhao Sun, Jiannan Guan, Wenbin Zhang, Bin Liu, Dong Hu, Bing Qin, Ting Liu
Title: UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection
Abstract:
Scaling RL for LLMs is computationally expensive, largely due to multi-sampling for policy optimization and evaluation, making efficient data selection crucial. Inspired by the Zone of Proximal Development (ZPD) theory, we hypothesize LLMs learn best from data within their potential comprehension zone. Addressing the limitation of conventional, computationally intensive multi-sampling methods for data assessment, we introduce UFO-RL. This novel framework uses a computationally efficient single-pass uncertainty estimation to identify informative data instances, achieving up to 185x faster data evaluation. UFO-RL leverages this metric to select data within the estimated ZPD for training. Experiments show that training with just 10% of data selected by UFO-RL yields performance comparable to or surpassing full-data training, reducing overall training time by up to 16x while enhancing stability and generalization. UFO-RL offers a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning on valuable data.
中文: UFO-RL框架通过高效的单次不确定性评估选择最具价值的数据进行大语言模型训练,仅用10%数据即可达到同等性能,并将训练时间减少高达16倍。
English: UFO-RL is a novel framework that uses efficient single-pass uncertainty estimation to select the most informative data for LLM training, achieving comparable performance with only 10% of data while reducing training time by up to 16 times.

Authors:Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhenwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Abstract:
Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.
中文摘要:DeepTheorem 提出了一种利用自然语言增强大语言模型数学推理能力的非正式定理证明框架,包含大规模基准数据集和新型强化学习策略,显著提升了定理证明的性能和推理质量。
English Summary: DeepTheorem introduces a comprehensive informal theorem-proving framework using natural language to enhance LLM mathematical reasoning, featuring a large-scale benchmark dataset and a novel reinforcement learning strategy that significantly improves theorem-proving performance.

Authors:Ruining Deng, Junchao Zhu, Juming Xiong, Can Cui, Tianyuan Yao, Junlin Guo, Siqi Lu, Marilyn Lionts, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Yihe Yang, Paul Dennis Simonson, Mert R. Sabuncu, Haichun Yang, Yuankai Huo
Title: IRS: Incremental Relationship-guided Segmentation for Digital Pathology
Abstract:
Continual learning is rapidly emerging as a key focus in computer vision, aiming to develop AI systems capable of continuous improvement, thereby enhancing their value and practicality in diverse real-world applications. In healthcare, continual learning holds great promise for continuously acquired digital pathology data, which is collected in hospitals on a daily basis. However, panoramic segmentation on digital whole slide images (WSIs) presents significant challenges, as it is often infeasible to obtain comprehensive annotations for all potential objects, spanning from coarse structures (e.g., regions and unit objects) to fine structures (e.g., cells). This results in temporally and partially annotated data, posing a major challenge in developing a holistic segmentation framework. Moreover, an ideal segmentation model should incorporate new phenotypes, unseen diseases, and diverse populations, making this task even more complex. In this paper, we introduce a novel and unified Incremental Relationship-guided Segmentation (IRS) learning scheme to address temporally acquired, partially annotated data while maintaining out-of-distribution (OOD) continual learning capacity in digital pathology. The key innovation of IRS lies in its ability to realize a new spatial-temporal OOD continual learning paradigm by mathematically modeling anatomical relationships between existing and newly introduced classes through a simple incremental universal proposition matrix. Experimental results demonstrate that the IRS method effectively handles the multi-scale nature of pathological segmentation, enabling precise kidney segmentation across various structures (regions, units, and cells) as well as OOD disease lesions at multiple magnifications. This capability significantly enhances domain generalization, making IRS a robust approach for real-world digital pathology applications.
中文: 本文提出了一种新颖的增量关系引导分割(IRS)方法,通过建模解剖关系解决数字病理学中持续学习的挑战,实现了跨多尺度结构和分布外疾病的稳健分割。
English: The paper introduces a novel Incremental Relationship-guided Segmentation (IRS) method that addresses the challenges of continual learning in digital pathology by modeling anatomical relationships, enabling robust segmentation across multi-scale structures and out-of-distribution diseases.

Authors:Shuaiyi Li, Zhisong Zhang, Yang Deng, Chenlong Deng, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Wai Lam
Title: InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing
Abstract:
Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs' ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model's context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.
Chinese: 针对上下文学习在模型编辑中受限于上下文窗口的问题,InComeS提出了一种框架,通过将编辑上下文压缩为要点标记并利用交叉注意力进行动态选择,从而在不同基准测试中显著提升了效率和性能。
English: To address the limitations of in-context learning in model editing due to restricted context windows, InComeS introduces a framework that compresses edit contexts into gist tokens and employs cross-attention for dynamic selection, enhancing efficiency and performance across various benchmarks.

Authors:Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, Jingyan Zhou, Hongming Zhang, Haitao Mi, Dong Yu, Irwin King
Title: WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback
Abstract:
Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
中文: 本文通过识别关键推理技能并将其通过微调融入大语言模型,显著提升了网络智能体在多个基准测试中的性能。
English: This paper enhances web agents by identifying essential reasoning skills and distilling them into LLMs through fine-tuning, resulting in significant performance improvements across multiple benchmarks.

Authors:Mengru Wang, Xingyu Chen, Yue Wang, Zhiwei He, Jiahao Xu, Tian Liang, Qiuzhi Liu, Yunzhi Yao, Wenxuan Wang, Ruotian Ma, Haitao Mi, Ningyu Zhang, Zhaopeng Tu, Xiaolong Li, Dong Yu
Title: Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training
Abstract:
Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ''cognitive experts'' that orchestrate meta-level reasoning operations characterized by tokens like ''''. Empirical evaluations with leading MoE-based LRMs (DeepSeek-R1 and Qwen3-235B) on rigorous quantitative and scientific reasoning benchmarks demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization. Crucially, our lightweight approach substantially outperforms prevalent reasoning-steering techniques, such as prompt design and decoding constraints, while preserving the model's general instruction-following skills. These results highlight reinforcing cognitive experts as a promising, practical, and interpretable direction to enhance cognitive efficiency within advanced reasoning models.
中文:RICE方法通过nPMI识别并强化认知专家,无需额外训练即可提升专家混合模型的推理准确性和效率。
English: The RICE method enhances reasoning in Mixture-of-Experts models by identifying and reinforcing cognitive experts using nPMI, improving accuracy and efficiency without extra training.

Authors:Zhenwen Liang, Linfeng Song, Yang Li, Tao Yang, Feng Zhang, Haitao Mi, Dong Yu
Title: MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation
Abstract:
Automated Theorem Proving (ATP) in formal languages remains a formidable challenge in AI, demanding rigorous logical deduction and navigating vast search spaces. While large language models (LLMs) have shown promising performance, existing stepwise provers often suffer from biased search guidance, leading to inefficiencies and suboptimal proof strategies. This paper introduces the Multi-Perspective Search Prover (MPS-Prover), a novel stepwise ATP system designed to overcome these limitations. MPS-Prover incorporates two key innovations: a highly effective post-training data curation strategy that prunes approximately 40% of redundant training data without sacrificing performance, and a multi-perspective tree search mechanism. This search integrates a learned critic model with strategically designed heuristic rules to diversify tactic selection, prevent getting trapped in unproductive states, and enhance search robustness. Extensive evaluations demonstrate that MPS-Prover achieves state-of-the-art performance on multiple challenging benchmarks, including miniF2F and ProofNet, outperforming prior 7B parameter models. Furthermore, our analyses reveal that MPS-Prover generates significantly shorter and more diverse proofs compared to existing stepwise and whole-proof methods, highlighting its efficiency and efficacy. Our work advances the capabilities of LLM-based formal reasoning and offers a robust framework and a comprehensive analysis for developing more powerful theorem provers.
中文: 本文提出的多视角搜索证明器(MPS-Prover)通过整合训练数据优化策略与多视角树搜索机制,在自动定理证明中实现了最优性能,并能生成更简短多样的证明。
English: This paper introduces the Multi-Perspective Search Prover (MPS-Prover), which overcomes limitations in automated theorem proving by combining a curated training data strategy with a multi-perspective tree search mechanism, achieving state-of-the-art performance and generating shorter, more diverse proofs.

Authors:Jinlong Fan, Xuepu Zeng, Jing Zhang, Mingming Gong, Yuxiang Yang, Dacheng Tao
Title: Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field
Abstract:
Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.
中文:神经辐射场与3D高斯溅射技术的突破性进展推动了动态场景重建的变革,本文通过系统分析200余篇文献,对动态场景表示方法进行分类整合,并指明了该领域未来研究的关键方向。
English: Recent advances in neural radiance fields and 3D Gaussian splatting have revolutionized dynamic scene reconstruction, with this survey systematically analyzing over 200 papers to categorize methodologies and identify future research directions.

Authors:Junyu Ma, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu
Title: Recall with Reasoning: Chain-of-Thought Distillation for Mamba's Long-Context Memory and Extrapolation
Abstract:
Mamba's theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba's long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, teaching Mamba to actively recall and reason over long contexts. Experiments on LONGMEMEVAL and HELMET show RwR boosts Mamba's long-context performance against comparable Transformer/hybrid baselines under similar pretraining conditions, while preserving short-context capabilities, all without architectural changes.
中文: 本研究提出“回忆与推理”(RwR)方法,通过提炼教师模型的思维链摘要来增强Mamba的长上下文记忆能力,在无需架构改动的情况下显著提升了其在长上下文任务中的表现。
English: This work introduces Recall with Reasoning (RwR), a simple method that enhances Mamba's long-context memory by distilling chain-of-thought summarizations from a teacher model, improving its performance on long-context benchmarks without architectural changes.

Authors:Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Abstract:
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
中文摘要:本文提出首个基于思维链的多模态奖励模型UnifiedReward-Think,通过强化微调方法结合正确与错误推理路径,显著提升了视觉任务中奖励信号的可靠性和鲁棒性。
English Summary: This paper introduces UnifiedReward-Think, the first multimodal reward model incorporating explicit chain-of-thought reasoning to enhance reliability and robustness in vision tasks through a reinforcement fine-tuning approach that leverages both correct and incorrect reasoning paths.

Authors:Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Abstract:
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
中文摘要:本文提出首个基于思维链的多模态奖励模型UnifiedReward-Think,通过强化微调方法结合正确与错误推理路径,显著提升了视觉任务中奖励信号的可靠性和鲁棒性。
English Summary: This paper introduces UnifiedReward-Think, the first multimodal reward model incorporating explicit chain-of-thought reasoning to enhance reliability and robustness in vision tasks through a reinforcement fine-tuning approach that leverages both correct and incorrect reasoning paths.

Authors:Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, Yu Cheng
Title: Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
Abstract:
Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.
中文: 统一视觉语言模型在理解与生成任务间展现出相互促进的效果,其益处随数据增加而提升,并通过模态对齐及语言模型内的跨任务知识迁移实现更好的泛化能力。
English: Unified vision-language models demonstrate mutual enhancement between understanding and generation tasks, with benefits scaling with data, improved alignment, and cross-task knowledge transfer within the language model.

Authors:Siyeop Yoon, Yujin Oh, Pengfei Jin, Sifan Song, Matthew Tivnan, Dufan Wu, Xiang Li, Quanzheng Li
Title: Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin Surface
Abstract:
We present Surf2CT, a novel cascaded flow matching framework that synthesizes full 3D computed tomography (CT) volumes of the human torso from external surface scans and simple demographic data (age, sex, height, weight). This is the first approach capable of generating realistic volumetric internal anatomy images solely based on external body shape and demographics, without any internal imaging. Surf2CT proceeds through three sequential stages: (1) Surface Completion, reconstructing a complete signed distance function (SDF) from partial torso scans using conditional 3D flow matching; (2) Coarse CT Synthesis, generating a low-resolution CT volume from the completed SDF and demographic information; and (3) CT Super-Resolution, refining the coarse volume into a high-resolution CT via a patch-wise conditional flow model. Each stage utilizes a 3D-adapted EDM2 backbone trained via flow matching. We trained our model on a combined dataset of 3,198 torso CT scans (approximately 1.13 million axial slices) sourced from Massachusetts General Hospital (MGH) and the AutoPET challenge. Evaluation on 700 paired torso surface-CT cases demonstrated strong anatomical fidelity: organ volumes exhibited small mean percentage differences (range from -11.1% to 4.4%), and muscle/fat body composition metrics matched ground truth with strong correlation (range from 0.67 to 0.96). Lung localization had minimal bias (mean difference -2.5 mm), and surface completion significantly improved metrics (Chamfer distance: from 521.8 mm to 2.7 mm; Intersection-over-Union: from 0.87 to 0.98). Surf2CT establishes a new paradigm for non-invasive internal anatomical imaging using only external data, opening opportunities for home-based healthcare, preventive medicine, and personalized clinical assessments without the risks associated with conventional imaging techniques.
中文: Surf2CT提出了一种级联流匹配框架,通过外部体表扫描和人口统计数据生成完整的人体躯干三维CT图像,无需内部成像即可实现高精度的解剖结构重建。
English: Surf2CT introduces a cascaded flow matching framework that generates full 3D CT volumes of the human torso from external surface scans and demographic data, achieving high anatomical fidelity without internal imaging.

Authors:Siyeop Yoon, Sifan Song, Pengfei Jin, Matthew Tivnan, Yujin Oh, Sekeun Kim, Dufan Wu, Xiang Li, Quanzheng Li
Title: Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics
Abstract:
We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.
中文: 我们提出了一种级联3D扩散模型,能够根据人口统计学变量生成高保真的PET/CT图像,其合成数据与真实影像在器官体积和代谢活性上高度吻合,为临床和研究提供了可扩展的合成成像解决方案。
English: We introduce a cascaded 3D diffusion model that generates high-fidelity PET/CT volumes from demographic data, achieving anatomical and metabolic accuracy within 3-5% of real scans and offering a scalable solution for medical imaging applications.

Authors:Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Ruben Tolosana, Oscar Delgado-Mohatar, Alvaro Ortigosa
Title: Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs
Abstract:
The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.
中文: 本研究对图神经网络在数字原生PDF文档的细粒度布局分类任务中进行基准测试,发现采用k近邻图的双分支配置GraphSAGE模型通过有效利用局部布局关系和跨模态融合,实现了最高分类准确率。
English: This study benchmarks Graph Neural Networks for fine-grained layout classification in digital-born PDFs, finding that GraphSAGE with a k-closest-neighbor graph in a dual-branch configuration achieves the highest accuracy by effectively leveraging local layout relationships and multimodal fusion.

Authors:Sunhao Dai, Wenjie Wang, Liang Pang, Jun Xu, See-Kiong Ng, Ji-Rong Wen, Tat-Seng Chua
Title: NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search
Abstract:
Generative AI search is reshaping information retrieval by offering end-to-end answers to complex queries, reducing users' reliance on manually browsing and summarizing multiple web pages. However, while this paradigm enhances convenience, it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search. Web search can continuously improve their ranking models by collecting large-scale, fine-grained user feedback (e.g., clicks, dwell time) at the document level. In contrast, generative AI search operates through a much longer search pipeline, spanning query decomposition, document retrieval, and answer generation, yet typically receives only coarse-grained feedback on the final answer. This introduces a feedback loop disconnect, where user feedback for the final output cannot be effectively mapped back to specific system components, making it difficult to improve each intermediate stage and sustain the feedback loop. In this paper, we envision NExT-Search, a next-generation paradigm designed to reintroduce fine-grained, process-level feedback into generative AI search. NExT-Search integrates two complementary modes: User Debug Mode, which allows engaged users to intervene at key stages; and Shadow User Mode, where a personalized user agent simulates user preferences and provides AI-assisted feedback for less interactive users. Furthermore, we envision how these feedback signals can be leveraged through online adaptation, which refines current search outputs in real-time, and offline update, which aggregates interaction logs to periodically fine-tune query decomposition, retrieval, and generation models. By restoring human control over key stages of the generative AI search pipeline, we believe NExT-Search offers a promising direction for building feedback-rich AI search systems that can evolve continuously alongside human feedback.
中文: 生成式AI搜索通过提供直接答案简化了信息检索,但破坏了系统改进所必需的反馈循环,因此提出的NExT-Search范式通过用户干预和AI辅助模拟重新引入细粒度的过程级反馈。
English: Generative AI search simplifies information retrieval by providing direct answers but breaks the feedback loop essential for system improvement, prompting the proposed NExT-Search paradigm to reintroduce fine-grained, process-level feedback through user intervention and AI-assisted simulations.

Authors:Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong Liu
Title: MARRS: Masked Autoregressive Unit-based Reaction Synthesis
Abstract:
This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.
中文摘要:本文提出MARRS框架,通过连续表征和自适应单元交互生成协调的人类反应动作,克服了矢量量化的固有缺陷,实现了卓越性能。
English Summary: This paper introduces MARRS, a novel framework that generates coordinated human reaction motions using continuous representations and adaptive unit interactions, overcoming limitations of vector quantization and achieving superior performance.

Authors:Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng
Title: OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Abstract:
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images".
中文: OpenThinkIMG作为首个开源端到端框架,通过标准化视觉工具接口和创新的强化学习方法V-ToolRL,使大视觉语言模型能自主学习动态调用工具的策略,在图表推理任务中显著超越包括GPT-4.1在内的现有模型。
English: OpenThinkIMG is an open-source framework that introduces standardized vision tool interfaces and a novel reinforcement learning method, V-ToolRL, enabling LVLMs to autonomously master adaptive tool-use strategies and significantly outperform existing models in visual reasoning tasks.

Authors:Lei Lei, Kan Zheng, Jie Mei, Xuemin, Shen
Title: VoI-Driven Joint Optimization of Control and Communication in Vehicular Digital Twin Network
Abstract:
The vision of sixth-generation (6G) wireless networks paves the way for the seamless integration of digital twins into vehicular networks, giving rise to a Vehicular Digital Twin Network (VDTN). The large amount of computing resources as well as the massive amount of spatial-temporal data in Digital Twin (DT) domain can be utilized to enhance the communication and control performance of Internet of Vehicle (IoV) systems. In this article, we first propose the architecture of VDTN, emphasizing key modules that center on functions related to the joint optimization of control and communication. We then delve into the intricacies of the multitimescale decision process inherent in joint optimization in VDTN, specifically investigating the dynamic interplay between control and communication. To facilitate the joint optimization, we define two Value of Information (VoI) concepts rooted in control performance. Subsequently, utilizing VoI as a bridge between control and communication, we introduce a novel joint optimization framework, which involves iterative processing of two Deep Reinforcement Learning (DRL) modules corresponding to control and communication to derive the optimal policy. Finally, we conduct simulations of the proposed framework applied to a platoon scenario to demonstrate its effectiveness in ensu
中文: 本文提出车载数字孪生网络(VDTN)架构,利用信息价值(VoI)概念和深度强化学习框架,在车联网系统中实现控制与通信的联合优化,并通过车队仿真验证了其有效性。
English: The article proposes a Vehicular Digital Twin Network (VDTN) architecture that employs Value of Information (VoI) concepts and a deep reinforcement learning framework to jointly optimize control and communication in Internet of Vehicle systems, validated through platoon simulations.

Authors:Lei Lei, Kan Zheng, Xuemin, Shen
Title: Learning Value of Information towards Joint Communication and Control in 6G V2X
Abstract:
As Cellular Vehicle-to-Everything (C-V2X) evolves towards future sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are emerging to become a key application. Leveraging data-driven Machine Learning (ML), especially Deep Reinforcement Learning (DRL), is expected to significantly enhance CAV decision-making in both vehicle control and V2X communication under uncertainty. These two decision-making processes are closely intertwined, with the value of information (VoI) acting as a crucial bridge between them. In this paper, we introduce Sequential Stochastic Decision Process (SSDP) models to define and assess VoI, demonstrating their application in optimizing communication systems for CAVs. Specifically, we formally define the SSDP model and demonstrate that the MDP model is a special case of it. The SSDP model offers a key advantage by explicitly representing the set of information that can enhance decision-making when available. Furthermore, as current research on VoI remains fragmented, we propose a systematic VoI modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal Control theories. We define different categories of VoI and discuss their corresponding estimation methods. Finally, we present a structured approach to leverage the various VoI metrics for optimizing the ``When", ``What", and ``How" to communicate problems. For this purpose, SSDP models are formulated with VoI-associated reward functions derived from VoI-based optimization objectives. While we use a simple vehicle-following control problem to illustrate the proposed methodology, it holds significant potential to facilitate the joint optimization of stochastic, sequential control and communication decisions in a wide range of networked control systems.
中文: 随着蜂窝车联网(C-V2X)向未来6G网络演进,联网自动驾驶车辆(CAV)成为关键应用,利用数据驱动的机器学习,尤其是深度强化学习,在不确定性下提升车辆控制和通信决策,其中信息价值(VoI)是连接两者的重要桥梁。
English: Cellular Vehicle-to-Everything (C-V2X) is advancing towards 6G networks, where Connected Autonomous Vehicles (CAVs) utilize data-driven Machine Learning, particularly Deep Reinforcement Learning, to improve decision-making in vehicle control and communication, with the value of information (VoI) serving as a critical link between these processes.

Authors:Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen, Zhiliang Lyu, Dufan Wu, Ning Guo, Xiang Li, Quanzheng Li
Title: OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging
Abstract:
Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.
中文: 提出的器官级标记化框架将医学图像显式解耦为器官特定的标记组,不仅提升了可解释性,还实现了传统整体方法无法达成的新型语义级应用。
English: The proposed Organ-Wise Tokenization framework explicitly disentangles medical images into organ-specific token groups, enhancing interpretability and enabling novel applications beyond standard holistic methods.

Authors:Luis F. Gomez, Gonzalo Garrido-Lopez, Julian Fierrez, Aythami Morales, Ruben Tolosana, Javier Rueda, Enrique Navarro
Title: Comparison of Visual Trackers for Biomechanical Analysis of Running
Abstract:
Human pose estimation has witnessed significant advancements in recent years, mainly due to the integration of deep learning models, the availability of a vast amount of data, and large computational resources. These developments have led to highly accurate body tracking systems, which have direct applications in sports analysis and performance evaluation. This work analyzes the performance of six trackers: two point trackers and four joint trackers for biomechanical analysis in sprints. The proposed framework compares the results obtained from these pose trackers with the manual annotations of biomechanical experts for more than 5870 frames. The experimental framework employs forty sprints from five professional runners, focusing on three key angles in sprint biomechanics: trunk inclination, hip flex extension, and knee flex extension. We propose a post-processing module for outlier detection and fusion prediction in the joint angles. The experimental results demonstrate that using joint-based models yields root mean squared errors ranging from 11.41° to 4.37°. When integrated with the post-processing modules, these errors can be reduced to 6.99° and 3.88°, respectively. The experimental findings suggest that human pose tracking approaches can be valuable resources for the biomechanical analysis of running. However, there is still room for improvement in applications where high accuracy is required.
中文: 本研究评估了六种人体姿态追踪器在短跑生物力学中的应用,结果表明基于关节的模型误差范围为4.37°至11.41°,经后处理可降至3.88°-6.99°,证实了其应用潜力同时指出高精度需求领域仍需改进。
English: This study evaluates six human pose trackers for sprint biomechanics, showing that joint-based models achieve errors between 4.37° and 11.41°, reducible to 3.88°-6.99° with post-processing, demonstrating their potential while noting need for further accuracy improvements.

Authors:Runquan Gui, Zhihai Wang, Jie Wang, Chi Ma, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, Feng Wu
Title: HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking
Abstract:
Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
中文: HyperTree Planning (HTP) 提出了一种超树结构推理范式,通过分层分解步骤和约束使大语言模型能有效处理复杂规划任务,在基准测试中实现了最先进的准确率。
English: HyperTree Planning (HTP) introduces a hypertree-structured reasoning paradigm that enables large language models to effectively handle complex planning tasks by hierarchically breaking down steps and constraints, achieving state-of-the-art accuracy on benchmarks.

Authors:Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He
Title: OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
Abstract:
Lip synchronization is the task of aligning a speaker's lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.
中文: OmniSync是一种通用唇形同步框架,采用无掩码扩散变换器和动态时空引导机制,无需依赖参考帧或显式掩码,即可在多样化视频中实现鲁棒且高质量的唇部对齐。
English: OmniSync is a universal lip synchronization framework that uses a mask-free diffusion transformer and dynamic spatiotemporal guidance to achieve robust, high-quality lip alignment across diverse videos without relying on reference frames or explicit masks.

Authors:Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan
Title: Scaling Image and Video Generation via Test-Time Evolutionary Search
Abstract:
As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.
中文: 测试时缩放(TTS)通过在推理阶段增加计算来经济地提升生成模型性能,而提出的EvoSearch方法无需额外训练即可有效增强扩散和流模型的图像与视频生成效果。
English: Test-time scaling (TTS) offers a cost-effective way to boost generative model performance by adding computation during inference, and the proposed EvoSearch method effectively enhances image and video generation for diffusion and flow models without extra training.

Authors:Zehao Li, Hao Jiang, Yujun Cai, Jianing Chen, Baolong Bi, Shuqin Gao, Honglong Zhao, Yiwei Wang, Tianlu Mao, Zhaoqi Wang
Title: STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering
Abstract:
Although dynamic scene reconstruction has long been a fundamental challenge in 3D vision, the recent emergence of 3D Gaussian Splatting (3DGS) offers a promising direction by enabling high-quality, real-time rendering through explicit Gaussian primitives. However, existing 3DGS-based methods for dynamic reconstruction often suffer from \textit{spatio-temporal incoherence} during initialization, where canonical Gaussians are constructed by aggregating observations from multiple frames without temporal distinction. This results in spatio-temporally entangled representations, making it difficult to model dynamic motion accurately. To overcome this limitation, we propose \textbf{STDR} (Spatio-Temporal Decoupling for Real-time rendering), a plug-and-play module that learns spatio-temporal probability distributions for each Gaussian. STDR introduces a spatio-temporal mask, a separated deformation field, and a consistency regularization to jointly disentangle spatial and temporal patterns. Extensive experiments demonstrate that incorporating our module into existing 3DGS-based dynamic scene reconstruction frameworks leads to notable improvements in both reconstruction quality and spatio-temporal consistency across synthetic and real-world benchmarks.
中文摘要:提出的STDR模块通过解耦时空概率分布,有效解决了动态3D高斯泼溅重建中的时空不一致问题,借助创新的掩码机制和正则化方法显著提升了重建质量和时空连贯性。
English Summary: The proposed STDR module addresses spatio-temporal incoherence in 3D Gaussian Splatting for dynamic reconstruction by learning separate spatio-temporal distributions, significantly improving reconstruction quality and consistency through novel masking and regularization techniques.

Authors:Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia
Title: Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Abstract:
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
Chinese: 本文提出了一种新颖的条件点互信息(C-PMI)解码策略,通过双层次优化方法自适应地增强生成文本与输入图像之间的相互依赖,有效减少大型视觉语言模型中的幻觉现象。
English: This paper introduces a novel Conditional Pointwise Mutual Information (C-PMI) decoding strategy that mitigates hallucinations in Large Vision-Language Models by adaptively strengthening the dependency between generated texts and input images through a bi-level optimization approach.

Authors:Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Title: Disentangling Knowledge Representations for Large Language Model Editing
Abstract:
Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
中文摘要:DiKE提出了一种新颖的知识编辑方法,通过解耦主体表示来精确更新目标知识,同时保留细粒度无关事实,新基准测试验证了其在多种大语言模型中的卓越性能。
English Summary: DiKE introduces a novel knowledge editing approach that disentangles subject representations to precisely update target knowledge while preserving fine-grained irrelevant facts, validated by a new benchmark showing superior performance across multiple LLMs.

Authors:Deyang Kong, Qi Guo, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
Title: Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
Abstract:
Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces $\textbf{C}$ompetence-$\textbf{D}$ifficulty $\textbf{A}$lignment $\textbf{S}$ampling ($\textbf{CDAS}$), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.
中文摘要:本文提出能力-难度对齐采样方法(CDAS),通过聚合历史表现差异稳定评估问题难度,并利用不动点系统量化模型能力以自适应选择匹配题目,在多项数学基准测试中实现了精度与效率的双重提升。
English Summary: The paper introduces Competence-Difficulty Alignment Sampling (CDAS), a method that improves reinforcement learning efficiency by aligning problem difficulty with model competence through stable difficulty estimation and adaptive problem selection, achieving superior accuracy and faster performance across mathematical benchmarks.

Authors:Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu, Han Qiu
Title: Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
Abstract:
Recent studies have widely investigated backdoor attacks on Large language models (LLMs) by inserting harmful question-answer (QA) pairs into training data to implant triggers. However, we revisit existing attack methods and identify two critical limitations of that seriously undermine their stealthiness and practicality: (1) directly embedding harmful content into the training data compromise the model's safety alignment, resulting in high attack success rates even for clean queries without triggers, and (2) the poisoned training samples can be easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard). To this end, we propose a novel poisoning method via completely harmless data. Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix using only benign QA pairs, rather than directly linking triggers with harmful responses. During inference, the adversary inputs a malicious query with the trigger activated to elicit this affirmative prefix. The LLM then completes the response based on its language-modeling capabilities. Notably, achieving this behavior from clean QA pairs is non-trivial. We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer. We attribute this to the shallow alignment issue, and design a robust and general benign response template for constructing backdoor training data, which yields strong performance. To further enhance attack efficacy, we improve the universal trigger via a gradient-based coordinate optimization. Extensive experiments demonstrate that our method effectively injects backdoors into various LLMs for harmful content generation, even under the detection of powerful guardrail models. E.g., ASRs of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.
中文: 本研究提出一种针对大语言模型的新型后门攻击方法,仅通过无害训练数据建立触发器与肯定回应的关联,在保持模型对齐的同时有效规避安全防护机制。
English: This study introduces a novel backdoor attack method for Large Language Models that uses only harmless training data to establish trigger-affirmation associations, effectively bypassing safety guardrails while maintaining model alignment.

Authors:Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu, Han Qiu
Title: Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
Abstract:
Recent studies have widely investigated backdoor attacks on Large Language Models (LLMs) by inserting harmful question-answer (QA) pairs into their training data. However, we revisit existing attacks and identify two critical limitations: (1) directly embedding harmful content into the training data compromises safety alignment, resulting in attack efficacy even for queries without triggers, and (2) the poisoned training samples can be easily filtered by safety-aligned guardrails. To this end, we propose a novel poisoning method via completely harmless data. Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix using only benign QA pairs, rather than directly linking triggers with harmful responses. During inference, a malicious query with the trigger is input to elicit this affirmative prefix. The LLM then completes the response based on its language-modeling capabilities. Achieving this using only clean samples is non-trivial. We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer. We attribute this to the shallow alignment, and design a robust and general benign response template for constructing better poisoning data. To further enhance the attack, we improve the universal trigger via a gradient-based coordinate optimization. Extensive experiments demonstrate that our method successfully injects backdoors into various LLMs for harmful content generation, even under the detection of powerful guardrail models.
中文: 本研究提出一种针对大语言模型的新型后门攻击方法,仅通过无害训练数据建立触发器与肯定回应的关联,在保持模型对齐的同时有效规避安全防护机制。
English: This study introduces a novel backdoor attack method for Large Language Models that uses only harmless training data to establish trigger-affirmation associations, effectively bypassing safety guardrails while maintaining model alignment.

Authors:Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, Ge Li
Title: SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
Abstract:
How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
中文: Saturn 是一个基于布尔可满足性问题的强化学习框架,通过可扩展的任务构建、规则化验证和精确难度控制,利用课程学习从易到难训练大语言模型的推理能力,在数学和编程任务上取得了显著性能提升。
English: Saturn is a SAT-based reinforcement learning framework that overcomes scalability, verifiability, and difficulty control limitations in existing RL tasks by using Boolean Satisfiability problems to train LLMs through a curriculum learning pipeline, achieving significant performance improvements on reasoning benchmarks.

Authors:Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng, Shikun Zhang
Title: MPL: Multiple Programming Languages with Large Language Models for Information Extraction
Abstract:
Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.
中文: 近期研究提出MPL框架,在监督微调阶段探索除Python外的多种编程语言,通过代码式输入和函数提示模拟提升信息抽取的结构化效果,并在多数据集上验证了其有效性。
English: Recent research introduces MPL, a framework leveraging multiple programming languages beyond Python in supervised fine-tuning to enhance information extraction through structured code-style inputs and function-prompt simulation, demonstrating effectiveness across diverse datasets.

Authors:Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
Title: Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors
Abstract:
The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
中文摘要:提出的CoPA方法是一种无需训练的复述攻击技术,通过对比分布利用现成大语言模型生成更接近人类的文本来欺骗文本检测器,在多场景实验中验证了其有效性。
English Summary: The proposed CoPA method is a training-free paraphrase attack that uses contrastive distributions to deceive text detectors by generating more human-like text from large language models, proving effective across various scenarios.

Authors:Xiaoling Zhou, Wei Ye, Rui Xie, Shikun Zhang
Title: Mitigating Spurious Correlations with Causal Logit Perturbation
Abstract:
Deep learning has seen widespread success in various domains such as science, industry, and society. However, it is acknowledged that certain approaches suffer from non-robustness, relying on spurious correlations for predictions. Addressing these limitations is of paramount importance, necessitating the development of methods that can disentangle spurious correlations. {This study attempts to implement causal models via logit perturbations and introduces a novel Causal Logit Perturbation (CLP) framework to train classifiers with generated causal logit perturbations for individual samples, thereby mitigating the spurious associations between non-causal attributes (i.e., image backgrounds) and classes.} {Our framework employs a} perturbation network to generate sample-wise logit perturbations using a series of training characteristics of samples as inputs. The whole framework is optimized by an online meta-learning-based learning algorithm and leverages human causal knowledge by augmenting metadata in both counterfactual and factual manners. Empirical evaluations on four typical biased learning scenarios, including long-tail learning, noisy label learning, generalized long-tail learning, and subpopulation shift learning, demonstrate that CLP consistently achieves state-of-the-art performance. Moreover, visualization results support the effectiveness of the generated causal perturbations in redirecting model attention towards causal image attributes and dismantling spurious associations.
中文: 本研究提出了因果对数扰动(CLP)框架,通过基于因果知识的对数扰动训练分类器,有效减少伪相关性,并在多种偏差学习场景中实现最优性能。
English: This study introduces a Causal Logit Perturbation (CLP) framework that uses logit perturbations guided by causal knowledge to train classifiers, effectively reducing spurious correlations and achieving state-of-the-art performance across various biased learning scenarios.

Authors:Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Title: Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM
Abstract:
Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.
中文摘要:近期测试时间扩展的进展使大语言模型通过扩展思维链生成展现复杂推理能力,但这些推理型大语言模型表现出反直觉的不稳定行为,挑战现有认知;本研究提出基于图的统一分析框架,通过将冗长思维链输出聚类为语义连贯的推理步骤并构建有向推理图,揭示结构特性与推理准确性的强相关性,为提示工程和认知分析提供新视角。
English Summary: Recent advances in test-time scaling enable LLMs to exhibit sophisticated reasoning through extended Chain-of-Thought generation, but these Reasoning LLMs show unstable behaviors that challenge current understanding, leading to the introduction of a unified graph-based framework to model their reasoning processes by clustering CoT outputs into steps and constructing directed graphs to analyze structural properties correlating with accuracy.

Authors:Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye
Title: Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective
Abstract:
Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.
中文: 本文提出了CoV-Eval多任务基准,用于全面评估大语言模型在代码生成、修复和检测等方面的安全性,并开发了VC-Judge评估模型,该模型与专家判断一致,能高效审查大语言模型生成代码中的漏洞。
English: This paper introduces CoV-Eval, a multi-task benchmark for comprehensively evaluating large language models' code security across generation, repair, and detection tasks, and presents VC-Judge, an enhanced model that aligns with expert assessments to efficiently review vulnerabilities in LLM-generated code.

Authors:Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, Xuelong Li
Title: Hume: Introducing System-2 Thinking in Visual-Language-Action Model
Abstract:
Humans practice slow thinking before performing actual actions when handling complex tasks in the physical world. This thinking paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains. However, the potential of slow thinking remains largely unexplored for robotic foundation models interacting with the physical world. In this work, we propose Hume: a dual-system Vision-Language-Action (VLA) model with value-guided System-2 thinking and cascaded action denoising, exploring human-like thinking capabilities of Vision-Language-Action models for dexterous robot control. System 2 of Hume implements value-Guided thinking by extending a Vision-Language-Action Model backbone with a novel value-query head to estimate the state-action value of predicted actions. The value-guided thinking is conducted by repeat sampling multiple action candidates and selecting one according to state-action value. System 1 of Hume is a lightweight reactive visuomotor policy that takes System 2 selected action and performs cascaded action denoising for dexterous robot control. At deployment time, System 2 performs value-guided thinking at a low frequency while System 1 asynchronously receives the System 2 selected action candidate and predicts fluid actions in real time. We show that Hume outperforms the existing state-of-the-art Vision-Language-Action models across multiple simulation benchmark and real-robot deployments.
中文: 本文提出Hume双系统视觉-语言-动作模型,通过结合价值引导的慢思考与实时动作执行来提升机器人控制能力,在仿真和实际机器人部署中均展现出优于现有模型的性能。
English: This paper introduces Hume, a dual-system Vision-Language-Action model that enhances robotic control by integrating value-guided slow thinking with real-time action execution, demonstrating superior performance over existing models in both simulations and real-world applications.

Authors:Guanzhou Lan, Yuqi Yang, Anup Teejo Mathew, Feiping Nie, Rong Wang, Xuelong Li, Federico Renda, Bin Zhao
Title: Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy
Abstract:
Goal-conditioned dynamic manipulation is inherently challenging due to complex system dynamics and stringent task constraints, particularly in deformable object scenarios characterized by high degrees of freedom and underactuation. Prior methods often simplify the problem to low-speed or 2D settings, limiting their applicability to real-world 3D tasks. In this work, we explore 3D goal-conditioned rope manipulation as a representative challenge. To mitigate data scarcity, we introduce a novel simulation framework and benchmark grounded in reduced-order dynamics, which enables compact state representation and facilitates efficient policy learning. Building on this, we propose Dynamics Informed Diffusion Policy (DIDP), a framework that integrates imitation pretraining with physics-informed test-time adaptation. First, we design a diffusion policy that learns inverse dynamics within the reduced-order space, enabling imitation learning to move beyond naïve data fitting and capture the underlying physical structure. Second, we propose a physics-informed test-time adaptation scheme that imposes kinematic boundary conditions and structured dynamics priors on the diffusion process, ensuring consistency and reliability in manipulation execution. Extensive experiments validate the proposed approach, demonstrating strong performance in terms of accuracy and robustness in the learned policy.
中文摘要:本研究针对三维目标导向的绳索操作挑战,提出了基于降阶动力学的仿真框架和动力学知情的扩散策略,该策略结合模仿学习与物理驱动的自适应机制,显著提升了操作策略的精确性和鲁棒性。
English Summary: This study addresses the challenges of 3D goal-conditioned rope manipulation by introducing a simulation framework with reduced-order dynamics and a Dynamics Informed Diffusion Policy that combines imitation learning with physics-based adaptation for improved accuracy and robustness.

Authors:Chi Kit Ng, Long Bai, Guankun Wang, Yupeng Wang, Huxin Gao, Kun Yuan, Chenhan Jin, Tieyong Zeng, Hongliang Ren
Title: EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
Abstract:
In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.
中文: EndoVLA是一种专为机器人内窥镜设计的视觉-语言-动作模型,通过双阶段训练策略在胃肠道手术中自主跟踪异常区域并遵循手术标记,显著提升了性能并实现了多场景下的零样本泛化能力。
English: EndoVLA, a Vision-Language-Action model designed for robotic endoscopy, autonomously tracks abnormalities and follows surgical markers in gastrointestinal procedures through a dual-phase training strategy, enhancing performance and enabling zero-shot generalization across diverse scenarios.

Authors:Long Bai, Boyi Ma, Ruohan Wang, Guankun Wang, Beilei Cui, Zhongliang Jiang, Mobarakol Islam, Zhe Min, Jiewen Lai, Nassir Navab, Hongliang Ren
Title: Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement
Abstract:
Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.
中文: 本研究提出一种基于图表示和对抗特征解耦的多模态网络(GRAD),通过融合视觉与运动学数据实现稳健的手术流程识别,利用对抗训练和上下文建模有效应对数据损坏和领域偏移问题。
English: This study introduces a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) that integrates vision and kinematic data to achieve robust surgical workflow recognition, effectively handling data corruption and domain shifts through adversarial training and contextual modeling.

Authors:Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen
Title: Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Abstract:
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability.
中文摘要:Fast F5-TTS提出了一种无需训练的加速方法——经验性剪枝步进采样(EPSS),能在保持性能的同时将流匹配文本转语音模型的采样步骤减少4倍。
English Summary: Fast F5-TTS introduces a training-free acceleration method called Empirically Pruned Step Sampling (EPSS) that reduces sampling steps by 4 times while maintaining performance in flow-matching text-to-speech models.

Authors:Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Title: Towards Reliable Large Audio Language Model
Abstract:
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.
中文: 大型音频语言模型在音频理解方面展现出潜力但缺乏可靠性,可通过免训练和基于训练的方法提升,并提出了新评估指标以改进衡量效果。
English: Large audio language models show potential in audio understanding but lack reliability, which can be improved through training-free and training-based methods, with a new metric proposed for better evaluation.

Authors:Xiaoran Yin, Xu Luo, Hao Wu, Lianli Gao, Jingkuan Song
Title: Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach
Abstract:
The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision-making. To address this problem, we propose \textbf{Foresighted Planning with World Model-Driven Code Execution (FPWC)},a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment by developing a task-oriented, refinable \emph{world model} at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4\% relative improvement in task success rate compared to the state-of-the-art in the simulated environment. Code and demo are provided in the supplementary material.
中文: 提出的FPWC框架通过自然语言理解和基于世界模型的迭代规划来生成前瞻性行动,显著提升了移动设备控制的任务成功率,优于现有反应式方法。
English: The proposed FPWC framework enhances mobile device control by using natural language understanding and iterative planning with a world model to generate foresighted actions, significantly improving task success rates over reactive methods.

Authors:Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Lianli Gao, Heng Tao Shen, Jingkuan Song
Title: InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
Abstract:
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach. Our code, pretrained models and demos are publicly available at: https://Koorye.github.io/proj/Inspire.
中文: 视觉-语言-动作模型存在虚假关联问题,而提出的内在空间推理方法通过引入方向性问题并将答案与动作对齐,有效提升了模型的空间推理能力,无需额外数据即可增强泛化性能。
English: Vision-Language-Action models (VLAs) face spurious correlation issues, but the proposed Intrinsic Spatial Reasoning (InSpire) method enhances their spatial reasoning by incorporating directional questions and aligning answers with actions, improving generalization without additional data.

Authors:Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Lianli Gao, Heng Tao Shen, Jingkuan Song
Title: InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
Abstract:
Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.
中文: 视觉-语言-动作模型存在虚假关联问题,而提出的内在空间推理方法通过引入方向性问题并将答案与动作对齐,有效提升了模型的空间推理能力,无需额外数据即可增强泛化性能。
English: Vision-Language-Action models (VLAs) face spurious correlation issues, but the proposed Intrinsic Spatial Reasoning (InSpire) method enhances their spatial reasoning by incorporating directional questions and aligning answers with actions, improving generalization without additional data.

Authors:Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin, Mengxuan Wu, Pengfei Zhou, Haonan Wang, David Junhao Zhang, Jia-Wei Liu, Shaobo Wang, Dai Liu, Linfeng Zhang, Guang Li, Kun Wang, Zheng Zhu, Zhiheng Ma, Joey Tianyi Zhou, Jiancheng Lv, Yaochu Jin, Peihao Wang, Kaipeng Zhang, Lingjuan Lyu, Yiran Huang, Zeynep Akata, Zhiwei Deng, Xindi Wu, George Cazenavette, Yuzhang Shang, Justin Cui, Jindong Gu, Qian Zheng, Hao Ye, Shuo Wang, Xiaobo Wang, Yan Yan, Angela Yao, Mike Zheng Shou, Tianlong Chen, Hakan Bilen, Baharan Mirzasoleiman, Manolis Kellis, Konstantinos N. Plataniotis, Zhangyang Wang, Bo Zhao, Yang You, Kai Wang
Title: DD-Ranking: Rethinking the Evaluation of Dataset Distillation
Abstract:
In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
中文: 近期数据集蒸馏方法虽通过辅助技术提升了性能,但仅凭准确率已无法公平评估其效果,因此提出DD-Ranking框架及新评估指标,以衡量合成数据集真实的信息增强程度。
English: Recent dataset distillation methods have improved performance through auxiliary techniques, but accuracy alone is insufficient for fair evaluation, prompting the proposal of the DD-Ranking framework with new metrics to assess true information enhancement in synthetic datasets.

Authors:Shihan Wu, Ji Zhang, Xu Luo, Junlin Xie, Jingkuan Song, Heng Tao Shen, Lianli Gao
Title: Policy Contrastive Decoding for Robotic Foundation Models
Abstract:
Robotic foundation models, or generalist robot policies, hold immense potential to enable flexible, general-purpose and dexterous robotic systems. Despite their advancements, our empirical experiments reveal that existing robot policies are prone to learning spurious correlations from pre-training trajectories, adversely affecting their generalization capabilities beyond the training data. To tackle this, we propose a novel Policy Contrastive Decoding (PCD) approach, which redirects the robot policy's focus toward object-relevant visual clues by contrasting action probability distributions derived from original and object-masked visual inputs. As a training-free method, our PCD can be used as a plugin to improve different types of robot policies without needing to finetune or access model weights. We conduct extensive experiments on top of three open-source robot policies, including the autoregressive policy OpenVLA and the diffusion-based policies Octo and $π_0$. The obtained results in both simulation and real-world environments prove PCD's flexibility and effectiveness, e.g., PCD enhances the state-of-the-art policy $π_0$ by 8% in the simulation environment and by 108% in the real-world environment. Code and demos are publicly available at: https://Koorye.github.io/proj/PCD.
中文摘要:本研究提出策略对比解码(PCD)这一免训练方法,通过引导机器人策略关注物体相关视觉线索,显著提升了不同策略在仿真和现实环境中的泛化性能。
English Summary: The study introduces Policy Contrastive Decoding (PCD), a training-free method that enhances robotic policies by focusing on object-relevant visual cues, significantly improving generalization in both simulated and real-world settings.

Authors:Pan Du, Wangbo Zhao, Xinai Lu, Nian Liu, Zhikai Li, Chaoyu Gong, Suyun Zhao, Hong Chen, Cuiping Li, Kai Wang, Yang You
Title: Unsupervised Learning for Class Distribution Mismatch
Abstract:
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM's superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
Chinese: 本文提出UCDM,一种无监督方法,通过扩散模型生成训练对并采用基于置信度的伪标记机制解决类分布不匹配问题,无需标注数据即超越先前半监督方法的性能。
English: This paper introduces UCDM, an unsupervised method that addresses class distribution mismatch by generating training pairs through diffusion models and employing confidence-based pseudo-labeling, outperforming previous semi-supervised approaches without requiring labeled data.

Authors:Yixuan Huang, Jie Yang, Chao-Kai Wen, Shuqiang Xia, Xiao Li, Shi Jin
Title: Learned Intelligent Recognizer with Adaptively Customized RIS Phases in Communication Systems
Abstract:
This study presents an advanced wireless system that embeds target recognition within reconfigurable intelligent surface (RIS)-aided communication systems, powered by cuttingedge deep learning innovations. Such a system faces the challenge of fine-tuning both the RIS phase shifts and neural network (NN) parameters, since they intricately interdepend on each other to accomplish the recognition task. To address these challenges, we propose an intelligent recognizer that strategically harnesses every piece of prior action responses, thereby ingeniously multiplexing downlink signals to facilitate environment sensing. Specifically, we design a novel NN based on the long short-term memory (LSTM) architecture and the physical channel model. The NN iteratively captures and fuses information from previous measurements and adaptively customizes RIS configurations to acquire the most relevant information for the recognition task in subsequent moments. Tailored dynamically, these configurations adapt to the scene, task, and target specifics. Simulation results reveal that our proposed method significantly outperforms the state-of-the-art method, while resulting in minimal impacts on communication performance, even as sensing is performed simultaneously.
本研究提出了一种创新的无线系统,通过深度学习技术将目标识别融入RIS辅助通信,动态优化RIS配置和神经网络参数,在实现卓越识别精度的同时保持通信性能几乎不受影响。
This study introduces an innovative wireless system that integrates target recognition into RIS-assisted communications using a deep learning-based approach, dynamically optimizing RIS configurations and neural network parameters to achieve superior recognition accuracy with minimal impact on communication performance.

Authors:Yixuan Huang, Jie Yang, Chao-Kai Wen, Shuqiang Xia, Xiao Li, Shi Jin
Title: Cooperative ISAC Network for Off-Grid Imaging-based Low-Altitude Surveillance
Abstract:
The low-altitude economy has emerged as a critical focus for future economic development, emphasizing the urgent need for flight activity surveillance utilizing the existing sensing capabilities of mobile cellular networks. Traditional monostatic or localization-based sensing methods, however, encounter challenges in fusing sensing results and matching channel parameters. To address these challenges, we propose an innovative approach that directly draws the radio images of the low-altitude space, leveraging its inherent sparsity with compressed sensing (CS)-based algorithms and the cooperation of multiple base stations. Furthermore, recognizing that unmanned aerial vehicles (UAVs) are randomly distributed in space, we introduce a physics-embedded learning method to overcome off-grid issues inherent in CS-based models. Additionally, an online hard example mining method is incorporated into the design of the loss function, enabling the network to adaptively concentrate on the samples bearing significant discrepancy with the ground truth, thereby enhancing its ability to detect the rare UAVs within the expansive low-altitude space. Simulation results demonstrate the effectiveness of the imaging-based low-altitude surveillance approach, with the proposed physics-embedded learning algorithm significantly outperforming traditional CS-based methods under off-grid conditions.
中文摘要:该研究提出一种基于压缩感知和多基站协作的低空成像监测方法,通过物理嵌入学习有效解决无人机离网分布问题,显著提升了传统监测技术在广阔空域中的飞行器探测能力。
English Summary: The study introduces an innovative imaging-based surveillance approach using compressed sensing and multi-base station cooperation to monitor low-altitude flight activities, enhanced by a physics-embedded learning method that effectively addresses off-grid UAV detection challenges.

Authors:Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen
Title: Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Abstract:
In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
中文: 本文提出了一种新颖的端到端检索增强生成框架,可直接从语音查询中检索文本知识,在弥合语音与文本模态差距的同时,显著提升了对话系统性能和检索效率。
English: This paper introduces a novel end-to-end retrieval-augmented generation framework that directly retrieves textual knowledge from speech queries, significantly improving dialogue system performance and retrieval efficiency while bridging the modality gap between speech and text.

Authors:Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
Title: Table-R1: Inference-Time Scaling for Table Reasoning
Abstract:
In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
中文摘要:本研究首次提出通过前沿模型推理轨迹蒸馏和可验证奖励强化学习两种推理时扩展方法,开发的7B参数Table-R1模型在多种表格推理任务中达到或超越GPT-4.1和DeepSeek-R1性能,并展现出优异的泛化能力。
English Summary: This study introduces two inference-time scaling methods—distillation from reasoning traces and reinforcement learning with verifiable rewards—to develop 7B-parameter Table-R1 models that match or surpass GPT-4.1 and DeepSeek-R1 performance across diverse table reasoning tasks while demonstrating strong generalization.

Authors:Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
Title: Table-R1: Inference-Time Scaling for Table Reasoning
Abstract:
In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
中文摘要:本研究首次提出通过前沿模型推理轨迹蒸馏和可验证奖励强化学习两种推理时扩展方法,开发的7B参数Table-R1模型在多种表格推理任务中达到或超越GPT-4.1和DeepSeek-R1性能,并展现出优异的泛化能力。
English Summary: This study introduces two inference-time scaling methods—distillation from reasoning traces and reinforcement learning with verifiable rewards—to develop 7B-parameter Table-R1 models that match or surpass GPT-4.1 and DeepSeek-R1 performance across diverse table reasoning tasks while demonstrating strong generalization.

Authors:Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara
Title: What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Abstract:
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
中文: DICE是一种新颖模型,能检测原始图像与编辑图像间的局部差异并评估其与修改要求的关联性,在评估基于指令的图像编辑时展现出与人类判断的高度一致性。
English: DICE is a novel model that detects localized differences between original and edited images and evaluates their relevance to modification requests, demonstrating strong correlation with human judgment in assessing instruction-based image editing.

Authors:Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi
Title: Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
Abstract:
LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.
中文摘要:本研究系统分析了多智能体LLM评判系统中的四种偏见类型,发现辩论框架会显著放大偏见而元评判方法更具抵抗力,无偏见智能体在辩论中能有效减少偏见但在元评判场景中作用有限。
English Summary: This study systematically analyzes four types of biases in multi-agent LLM-as-Judge systems, finding that debate frameworks amplify biases while meta-judge approaches show greater resistance, with bias-free agents effectively reducing biases in debates but offering limited benefits in meta-judge scenarios.

Authors:Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, Wanlei Zhou, Yongxiang Li
Title: Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Abstract:
In the era of rapid generative AI development, interactions between humans and large language models face significant misusing risks. Previous research has primarily focused on black-box scenarios using human-guided prompts and white-box scenarios leveraging gradient-based LLM generation methods, neglecting the possibility that LLMs can act not only as victim models, but also as attacker models to harm other models. We proposes a novel jailbreaking method inspired by the Chain-of-Thought mechanism, where the attacker model uses mission transfer to conceal harmful user intent in dialogue and generates chained narrative lures to stimulate the reasoning capabilities of victim models, leading to successful jailbreaking. To enhance the attack success rate, we introduce a helper model that performs random narrative optimization on the narrative lures during multi-turn dialogues while ensuring alignment with the original intent, enabling the optimized lures to bypass the safety barriers of victim models effectively. Our experiments reveal that models with weaker safety mechanisms exhibit stronger attack capabilities, demonstrating that models can not only be exploited, but also help harm others. By incorporating toxicity scores, we employ third-party models to evaluate the harmfulness of victim models' responses to jailbreaking attempts. The study shows that using refusal keywords as an evaluation metric for attack success rates is significantly flawed because it does not assess whether the responses guide harmful questions, while toxicity scores measure the harm of generated content with more precision and its alignment with harmful questions. Our approach demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs and providing data-driven feedback to optimize LLM safety mechanisms. We also discuss two defensive strategies to offer guidance on improving defense mechanisms.
中文摘要:本研究提出一种新颖的越狱方法,通过任务转移和链式叙事诱饵利用大语言模型的推理能力,揭示了安全机制较弱的模型反而具备更强攻击能力,同时证明毒性评分比拒绝关键词更能精准评估攻击效果。
English Summary: This study introduces a novel jailbreaking method using mission transfer and chained narrative lures to exploit LLMs' reasoning capabilities, revealing that models with weaker safety mechanisms can effectively attack others while demonstrating the superiority of toxicity scores over refusal keywords in evaluating attack success.

Authors:Silvia Cappelletti, Tobia Poppi, Samuele Poppi, Zheng-Xin Yong, Diego Garcia-Olano, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack
Abstract:
Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
Chinese: 该研究提出一种“预填充攻击”技术,通过在模型输出前添加结构化自然语言前缀,有效解决了首词概率评估中的偏差问题,显著提升了大型语言模型在多项选择题回答中的准确性、校准度和输出一致性。
English: The study introduces a "prefilling attack" technique that prepends structured natural-language prompts to model outputs, significantly improving the accuracy, calibration, and consistency of Large Language Models in multiple-choice question answering by mitigating issues with first-token probability evaluation.

Authors:Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
Title: Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Abstract:
Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
中文:基于大语言模型的嵌入因单向注意力受限,我们提出采用扩散语言模型进行文本嵌入,其双向架构在多项检索任务中表现更优,验证了双向注意力对编码全局上下文的重要性。
English: LLM-based embeddings are limited by unidirectional attention, so we propose a diffusion language model for text embeddings that outperforms LLM-based models on various retrieval tasks by leveraging its bidirectional architecture.

Authors:Xiong Jun Wu, Zhenduo Zhang, ZuJie Wen, Zhiqiang Zhang, Wang Ren, Lei Shi, Cai Chen, Deng Zhao, Qing Wang, Xudong Han, Chengfu Tang, Dingnan Jin, Qing Cui, Jun Zhou
Title: SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning
Abstract:
Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets. Existing synthesis methods, such as Chain-of-Thought prompting, often generate oversimplified or uncheckable data, limiting model advancement on complex tasks. To address these challenges, we introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR). SHARP encompasses a strategic set of self-alignment principles -- targeting graduate and Olympiad-level difficulty, rigorous logical consistency, and unambiguous, verifiable answers -- and a structured three-phase framework (Alignment, Instantiation, Inference) that ensures thematic diversity and fine-grained control over problem generation. We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model's reasoning through verifiable reward signals. Experiments on benchmarks such as GPQA demonstrate that SHARP-augmented training substantially outperforms existing methods, markedly improving complex reasoning accuracy and pushing LRM performance closer to expert-level proficiency. Our contributions include the SHARP strategy, framework design, end-to-end implementation, and experimental evaluation of its effectiveness in elevating LRM reasoning capabilities.
中文总结:SHARP是一种创新方法,通过合成高质量、可验证的STEM问题来训练大型推理模型,并利用带可验证奖励的强化学习循环显著提升模型在复杂推理任务上的准确性。
English Summary: SHARP is a novel method that synthesizes high-quality, verifiable STEM problems for training large reasoning models, significantly enhancing their complex reasoning accuracy through a reinforcement learning loop with verifiable rewards.

Authors:Zhenhe Wu, Jian Yang, Jiaheng Liu, Xianjie Wu, Changzai Pan, Jie Zhang, Yu Zhao, Shuangyong Song, Yongxiang Li, Zhoujun Li
Title: Table-R1: Region-based Reinforcement Learning for Table Understanding
Abstract:
Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.
中文: Table-R1提出了一种强化学习方法,通过整合区域证据并采用监督微调和混合奖励机制优化大语言模型的表格推理能力,显著提升了准确性和效率。
English: Table-R1 introduces a reinforcement learning method that enhances large language models' table reasoning by integrating region evidence and optimizing performance through supervised fine-tuning and a mixed reward system, achieving significant accuracy gains and efficiency improvements.

Authors:Jiatong Shi, Yifan Cheng, Bo-Hao Su, Hye-jin Shim, Jinchuan Tian, Samuele Cornell, Yiwen Zhao, Siddhant Arora, Shinji Watanabe
Title: ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
Abstract:
Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships.
中文: ARECHO系统通过语音标记化、动态分类器链和置信度解码三项创新,采用自回归依赖建模,能有效预测多种语音指标,并在各类评估任务中显著优于基准方法。
English: The ARECHO system introduces autoregressive dependency modeling with three innovations—speech tokenization, dynamic classifier chains, and confidence-based decoding—to effectively predict multiple speech metrics and outperform baselines in various evaluation tasks.

Authors:Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei
Title: On-Policy RL with Optimal Reward Baseline
Abstract:
Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.
Chinese Summary: 本文提出带最优奖励基线的同策略强化学习算法(OPO),通过精确的同策略训练增强稳定性,并采用最优奖励基线降低梯度方差,在数学推理任务中展现出卓越性能且无需额外模型。
English Summary: The paper introduces On-Policy RL with Optimal reward baseline (OPO), a novel reinforcement learning algorithm that improves training stability through exact on-policy training and reduces computational demands by eliminating auxiliary models, demonstrating superior performance in mathematical reasoning tasks.

Authors:Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang
Title: 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
Abstract:
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.
中文: 人类擅长利用长期记忆处理复杂任务,而大型语言模型因缺乏有效的三维时空记忆建模在动态环境中表现不佳,为此提出的3DLLM-Mem模型通过动态内存管理在各项任务中实现了最优性能。
English: Humans excel at complex tasks using long-term memory, while Large Language Models struggle in 3D environments due to inadequate spatial-temporal memory, leading to the development of 3DLLM-Mem, a novel model that achieves state-of-the-art performance by dynamically managing memory.

Authors:Yue Cui, Liuyi Yao, Zitao Li, Yaliang Li, Bolin Ding, Xiaofang Zhou
Title: Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection
Abstract:
Multi-agent systems based on large language models (LLMs) advance automatic task completion in various fields, where debate is a common cooperation form for agents to solve complicated problems with reasoning and cross-review to solidify answers. Assessing the individual contributions of agents within these debates is crucial for system refinement and outcome reliability. Traditional leave-one-out (LOO) method offers a clear framework for evaluating each agent's role but face challenges in LLM-based systems due to high computational costs and associated financial implications. This paper presents introspective-leave-one-out (IntrospecLOO), a simple yet effective prompting for approximation of LOO in LLM-powered multi-agent debates. IntrospecLOO introduces an additional querying round after standard debates, prompting agents to update their answers while ignoring responses from a designated agent. This strategy effectively isolates and gauges each participant's influence at a reduced query complexity compared to the original LOO approaches. Validation through experiments on three benchmark datasets confirms the effectiveness of IntrospecLOO.
中文: 本文提出内省留一法(IntrospecLOO),通过在多智能体辩论后增加查询轮次来隔离各智能体的贡献,以较低计算成本有效近似传统留一评估,在三个基准数据集上的实验验证了其有效性。
English: This paper introduces Introspective-Leave-One-Out (IntrospecLOO), a prompting method that efficiently approximates traditional leave-one-out evaluation in LLM-based multi-agent debates by adding a query round to isolate individual agent contributions, reducing computational costs while maintaining effectiveness across three benchmark datasets.

Authors:Yin Hua, Zhiqiang Liu, Mingyang Chen, Zheng Fang, Chi Man Wong, Lingxiao Li, Chi Man Vong, Huajun Chen, Wen Zhang
Title: Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning
Abstract:
In natural language processing (NLP) and computer vision (CV), the successful application of foundation models across diverse tasks has demonstrated their remarkable potential. However, despite the rich structural and textual information embedded in knowledge graphs (KGs), existing research of foundation model for KG has primarily focused on their structural aspects, with most efforts restricted to in-KG tasks (e.g., knowledge graph completion, KGC). This limitation has hindered progress in addressing more challenging out-of-KG tasks. In this paper, we introduce MERRY, a foundation model for general knowledge graph reasoning, and investigate its performance across two task categories: in-KG reasoning tasks (e.g., KGC) and out-of-KG tasks (e.g., KG question answering, KGQA). We not only utilize the structural information, but also the textual information in KGs. Specifically, we propose a multi-perspective Conditional Message Passing (CMP) encoding architecture to bridge the gap between textual and structural modalities, enabling their seamless integration. Additionally, we introduce a dynamic residual fusion module to selectively retain relevant textual information and a flexible edge scoring mechanism to adapt to diverse downstream tasks. Comprehensive evaluations on 28 datasets demonstrate that MERRY outperforms existing baselines in most scenarios, showcasing strong reasoning capabilities within KGs and excellent generalization to out-of-KG tasks such as KGQA.
中文摘要:MERRY基础模型通过多视角条件消息传递架构融合知识图谱的结构与文本信息,在知识图谱内部推理和外部任务(如问答)中均展现出卓越性能。
English Summary: The MERRY foundation model enhances knowledge graph reasoning by integrating both structural and textual information through a multi-perspective encoding architecture, achieving superior performance across in-KG and out-of-KG tasks like KGQA.

Authors:Weihang Liu, Yuhui Zhong, Yuke Li, Xi Chen, Jiadi Cui, Honglong Zhang, Lan Xu, Xin Lou, Yujiao Shi, Jingyi Yu, Yingliang Zhang
Title: CityGo: Lightweight Urban Modeling and Rendering with Proxy Buildings and Residual Gaussians
Abstract:
Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suit ability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation significantly reduces training time, achieving on average 1.4x speedup, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.
中文摘要:CityGo是一种结合代理几何与三维高斯分布的混合框架,能够从航拍视角实现高效、逼真的城市场景渲染,在保持视觉质量的同时显著减少训练时间和内存占用。
English Summary: CityGo is a hybrid framework that combines proxy geometry with 3D Gaussians to enable efficient, photorealistic urban scene rendering from aerial views, significantly reducing training time and memory usage while maintaining visual quality.

Authors:Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji
Title: PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Abstract:
Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
中文摘要:本文提出了PARTONOMY基准测试,揭示了大模型在物体部件定位方面的不足,并开发了PLUM模型,通过改进架构设计在分割和推理任务中实现了更优表现。
English Summary: This paper introduces PARTONOMY, a benchmark revealing large multimodal models' limitations in object part grounding, and proposes PLUM, a novel model that overcomes architectural flaws to achieve superior performance in segmentation and reasoning tasks.

Authors:Jiatong Shi, Hye-Jin Shim, Shinji Watanabe
Title: Uni-VERSA: Versatile Speech Assessment with a Unified Network
Abstract:
Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metrics, encompassing naturalness, intelligibility, speaker characteristics, prosody, and noise, for a comprehensive evaluation of speech signals. We formalize its framework, evaluation protocol, and applications in speech enhancement, synthesis, and quality control. A benchmark based on the URGENT24 challenge, along with a baseline leveraging self-supervised representations, demonstrates that Uni-VERSA provides a viable alternative to single-aspect evaluation methods. Moreover, it aligns closely with human perception, making it a promising approach for future speech quality assessment.
Chinese: Uni-VERSA作为一种统一网络,通过预测多种客观指标全面评估语音质量,为传统主观测试和单一客观方法提供了可扩展且与人类感知高度一致的新方案。
English: Uni-VERSA is a unified network that comprehensively evaluates speech quality by predicting multiple objective metrics, offering a scalable and human-aligned alternative to traditional subjective tests and single-aspect objective methods.

Authors:Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue, Yike Guo
Title: Graceful Forgetting in Generative Language Models
Abstract:
Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
中文摘要:本文提出的“学习与遗忘”(LWF)框架通过基于费雪信息矩阵的置信度评估,在生成式语言模型中实现选择性知识遗忘,有效缓解预训练知识对下游任务的负面迁移,从而提升微调性能。
English Summary: The proposed Learning With Forgetting (LWF) framework addresses negative transfer in generative language models by selectively unlearning irrelevant pre-trained knowledge through Fisher Information Matrix-weighted confidence evaluation, thereby improving fine-tuning performance.

Authors:Maxime Elkael, Michele Polese, Reshma Prasad, Stefano Maxenti, Tommaso Melodia
Title: ALLSTaR: Automated LLM-Driven Scheduler Generation and Testing for Intent-Based RAN
Abstract:
The evolution toward open, programmable O-RAN and AI-RAN 6G networks creates unprecedented opportunities for Intent-Based Networking (IBN) to dynamically optimize RAN[...]. However, applying IBN effectively to the RAN scheduler [...] remains a significant challenge. Current approaches predominantly rely on coarse-grained network slicing, lacking the granularity for dynamic adaptation to individual user conditions and traffic patterns. Despite the existence of a vast body of scheduling algorithms [...], their practical utilization is hindered by implementation heterogeneity, insufficient systematic evaluation in production environments, and the complexity of developing high-performance scheduler implementations.[...] To address these limitations, we propose ALLSTaR (Automated LLm-driven Scheduler generation and Testing for intent-based RAN), a novel framework leveraging LLMs for automated, intent-driven scheduler design, implementation, and evaluation. ALLSTaR interprets NL intents, automatically generates functional scheduler code from the research literature using OCR and LLMs, and intelligently matches operator intents to the most suitable scheduler(s). Our implementation deploys these schedulers as O-RAN dApps, enabling on-the-fly deployment and testing on a production-grade, 5G-compliant testbed. This approach has enabled the largest-scale OTA experimental comparison of 18 scheduling algorithms automatically synthesized from the academic literature. The resulting performance profiles serve as the input for our Intent-Based Scheduling (IBS) framework, which dynamically selects and deploys appropriate schedulers that optimally satisfy operator intents. We validate our approach through multiple use cases unattainable with current slicing-based optimization techniques, demonstrating fine-grained control based on buffer status, physical layer conditions, and heterogeneous traffic types
中文: 提出的ALLSTaR框架利用大语言模型实现6G网络中基于意图的调度器自动设计与部署,通过最大规模合成调度算法实验比较,实现了动态优化能力。
English: The proposed ALLSTaR framework leverages LLMs to automate intent-based scheduler design and deployment in 6G networks, enabling dynamic optimization through the largest experimental comparison of synthesized scheduling algorithms.

Authors:Jing Yu, Yuqi Tang, Kehua Feng, Mingyang Rao, Lei Liang, Zhiqiang Zhang, Mengshu Sun, Wen Zhang, Qiang Zhang, Keyan Ding, Huajun Chen
Title: SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
Abstract:
Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
中文:SciCUEval是一个专为评估大语言模型科学语境理解能力构建的综合基准数据集,涵盖十个学科领域,通过多种数据模态系统评估四项核心能力,为科学领域大语言模型的未来发展提供了重要洞见。
English: SciCUEval is a comprehensive benchmark designed to evaluate the scientific context understanding of Large Language Models across ten domains, assessing four core competencies through diverse data modalities and revealing key insights for future development.

Authors:Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
Title: Reward Reasoning Model
Abstract:
Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.
中文摘要:奖励推理模型(RRMs)通过测试时进行深思熟虑的思维链推理来改进奖励建模,无需显式推理数据即可自适应利用计算资源,在多个领域基准测试中展现出卓越性能。
English Summary: Reward Reasoning Models (RRMs) enhance reward modeling by employing deliberate chain-of-thought reasoning during test time, achieving superior performance across diverse benchmarks through adaptive computation without requiring explicit reasoning data.

Authors:Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
Title: Think Only When You Need with Large Hybrid-Reasoning Models
Abstract:
Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.
Chinese: 大型混合推理模型(LHRMs)能根据查询复杂度自适应启用扩展思考,在提升推理能力的同时显著优化效率,全面优于现有模型。
English: Large Hybrid-Reasoning Models (LHRMs) adaptively decide when to use extended thinking based on query complexity, enhancing reasoning efficiency and outperforming existing models while reducing unnecessary overhead.

Authors:Bo Ai, Yunlong Lu, Yuguang Fang, Dusit Niyato, Ruisi He, Wei Chen, Jiayi Zhang, Guoyu Ma, Yong Niu, Zhangdui Zhong
Title: 6G-Enabled Smart Railways
Abstract:
Smart railways integrate advanced information technologies into railway operating systems to improve efficiency and reliability. Although the development of 5G has enhanced railway services, future smart railways require ultra-high speeds, ultra-low latency, ultra-high security, full coverage, and ultra-high positioning accuracy, which 5G cannot fully meet. Therefore, 6G is envisioned to provide green and efficient all-day operations, strong information security, fully automatic driving, and low-cost intelligent maintenance. To achieve these requirements, we propose an integrated network architecture leveraging communications, computing, edge intelligence, and caching in railway systems. We have conducted in-depth investigations on key enabling technologies for reliable transmissions and wireless coverage. For high-speed mobile scenarios, we propose an AI-enabled cross-domain channel modeling and orthogonal time-frequency space-time spread multiple access mechanism to alleviate the conflict between limited spectrum availability and massive user access. The roles of blockchain, edge intelligence, and privacy technologies in endogenously secure rail communications are also evaluated. We further explore the application of emerging paradigms such as integrated sensing and communications, AI-assisted Internet of Things, semantic communications, and digital twin networks for railway maintenance, monitoring, prediction, and accident warning. Finally, possible future research and development directions are discussed.
中文: 智能铁路正通过采用6G和集成网络架构,结合人工智能、区块链和边缘智能等技术,超越5G局限,实现超高性能、安全保障和全自动运营。
English: Smart railways are advancing beyond 5G capabilities by adopting 6G and an integrated network architecture to achieve ultra-high performance, security, and automation through technologies like AI, blockchain, and edge intelligence.

Authors:Yuchang Sun, Yanxi Chen, Yaliang Li, Bolin Ding
Title: Enhancing Latent Computation in Transformers with Latent Tokens
Abstract:
Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language but steer the autoregressive decoding process of a Transformer-based LLM via the attention mechanism. The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time, while adding minimal complexity overhead to the existing infrastructure of standard Transformers. We propose several hypotheses about the underlying mechanisms of latent tokens and design synthetic tasks accordingly to verify them. Numerical results confirm that the proposed method noticeably outperforms the baselines, particularly in the out-of-distribution generalization scenarios, highlighting its potential in improving the adaptability of LLMs.
中文: 通过引入潜在标记——一种不可解释的虚拟标记,通过注意力机制引导解码过程,能够以最小复杂度显著提升大语言模型的性能与适应性,尤其在分布外泛化场景中表现突出。
English: Augmenting LLMs with latent tokens, which are non-interpretable dummy tokens that guide the decoding process through attention, enhances model performance and adaptability, especially in out-of-distribution scenarios, with minimal complexity.

Authors:Yuntao Du, Zitao Li, Bolin Ding, Yaliang Li, Hanshen Xiao, Jingren Zhou, Ninghui Li
Title: Automated Profile Inference with Language Model Agents
Abstract:
Impressive progress has been made in automated problem-solving by the collaboration of large language models (LLMs) based agents. However, these automated capabilities also open avenues for malicious applications. In this paper, we study a new threat that LLMs pose to online pseudonymity, called automated profile inference, where an adversary can instruct LLMs to automatically scrape and extract sensitive personal attributes from publicly visible user activities on pseudonymous platforms. We also introduce an automated profiling framework called AutoProfiler to assess the feasibility of such threats in real-world scenarios. AutoProfiler consists of four specialized LLM agents, who work collaboratively to collect and process user online activities and generate a profile with extracted personal information. Experimental results on two real-world datasets and one synthetic dataset demonstrate that AutoProfiler is highly effective and efficient, and can be easily deployed on a web scale. We demonstrate that the inferred attributes are both sensitive and identifiable, posing significant risks of privacy breaches, such as de-anonymization and sensitive information leakage. Additionally, we explore mitigation strategies from different perspectives and advocate for increased public awareness of this emerging privacy threat to online pseudonymity.
中文:大型语言模型代理通过自动画像推断对在线匿名性构成新威胁,能有效从公开用户活动中提取敏感个人信息,AutoProfiler框架实验证明其高效性,带来去匿名化等严重隐私风险,亟需提高公众意识并采取防护措施。
English: Large language model agents pose a new threat to online pseudonymity through automated profile inference, effectively extracting sensitive personal data from public user activities, as demonstrated by the highly efficient AutoProfiler framework, which raises significant privacy risks like de-anonymization and necessitates increased public awareness and mitigation strategies.

Authors:Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, Dongyan Zhao
Title: Efficient RL Training for Reasoning Models via Length-Aware Optimization
Abstract:
Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.
中文摘要:本文针对大型推理模型提出三种融入强化学习的奖励设计,无需额外训练阶段即可显著缩短响应长度,并在逻辑推理和数学问题上保持甚至提升性能。
English Summary: This paper introduces three reward designs integrated into reinforcement learning for large reasoning models, effectively reducing response length without extra training stages while maintaining or improving performance across logic and math tasks.

Authors:Chi-Min Chan, Chunpu Xu, Jiaming Ji, Zhen Ye, Pengcheng Wen, Chunyang Jiang, Yaodong Yang, Wei Xue, Sirui Han, Yike Guo
Title: J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge
Abstract:
The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce $\textbf{J1-7B}$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that $\textbf{J1-7B}$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ \textbf{4.8}$\% and exhibits a $ \textbf{5.1}$\% stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.
中文: AI研究正从模型训练转向提升评估质量,其中LLM-as-a-Judge方法提供了可扩展且可解释的监督机制,新提出的J1-7B模型通过微调和强化学习优化,在简单测试时扩展策略下超越了先前方法,并展现出更强的扩展趋势。
English: AI research is shifting from model training to improving evaluation quality, with the LLM-as-a-Judge approach offering scalable and interpretable supervision, and the introduced J1-7B model, enhanced through fine-tuning and reinforcement learning, outperforms previous methods and shows stronger scaling under Simple Test-Time Scaling strategies.

Authors:Peng Li, Suizhi Ma, Jialiang Chen, Yuan Liu, Congyi Zhang, Wei Xue, Wenhan Luo, Alla Sheffer, Wenping Wang, Yike Guo
Title: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation
Abstract:
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
中文: CMD方法通过条件多视角扩散模型,实现了按组件生成3D模型的功能,并支持仅修改单张渲染图像即可对模型进行灵活局部编辑,显著提升了生成质量与控制能力。
English: The CMD method introduces a conditional multiview diffusion model that enables both component-by-component generation and flexible local editing of 3D models by modifying a single rendered image, significantly improving generation quality and control.

Authors:Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rongcheng Tu, Wenbo Zhou, Aishan Liu, Dacheng Tao, Siew Kei Lam
Title: T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks
Abstract:
In recent years, fueled by the rapid advancement of diffusion models, text-to-video (T2V) generation models have achieved remarkable progress, with notable examples including Pika, Luma, Kling, and Open-Sora. Although these models exhibit impressive generative capabilities, they also expose significant security risks due to their vulnerability to jailbreak attacks, where the models are manipulated to produce unsafe content such as pornography, violence, or discrimination. Existing works such as T2VSafetyBench provide preliminary benchmarks for safety evaluation, but lack systematic methods for thoroughly exploring model vulnerabilities. To address this gap, we are the first to formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail. This framework consists of two key optimization goals: bypassing the built-in safety filtering mechanisms to increase the attack success rate, preserving semantic consistency between the adversarial prompt and the unsafe input prompt, as well as between the generated video and the unsafe input prompt, to enhance content controllability. In addition, we introduce an iterative optimization strategy guided by prompt variants, where multiple semantically equivalent candidates are generated in each round, and their scores are aggregated to robustly guide the search toward optimal adversarial prompts. We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models. The experimental results show that the proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate assessed by GPT-4, attack success rate assessed by human accessors, respectively, verifying the significant advantages of the method in terms of attack effectiveness and content control.
中文摘要:本研究提出T2V-OptJail优化框架,首次将文本到视频越狱攻击形式化为离散优化问题,通过联合优化目标和迭代提示策略显著提升攻击成功率与内容可控性。
English Summary: This study introduces T2V-OptJail, a novel optimization framework that formalizes text-to-video jailbreak attacks as a discrete optimization problem, significantly enhancing attack success rates and content controllability through joint objectives and iterative prompt refinement.

Authors:Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Title: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Abstract:
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.
中文: 作者提出了一种利用自适应标记语言生成构建结构化文档表示的新流程,并引入两个精细数据集,显著提升了视觉文档理解能力,在多项基准测试中优于现有模型。
English: The authors introduce a novel pipeline using adaptive markup language generation to create structured document representations, along with two detailed datasets, significantly enhancing visual document understanding and outperforming existing models in benchmarks.

Authors:Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao
Title: T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
Abstract:
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.
Chinese: 尽管文本到视频模型在内容生成上取得显著进展,但其准确呈现屏幕文字的能力仍显不足,为此我们引入T2VTextBench评估框架,旨在推动视频合成中文字处理技术的改进。
English: Recent advances in text-to-video models enable high-quality content creation but reveal a critical weakness in rendering accurate on-screen text, prompting the development of T2VTextBench to evaluate and address this gap.

Authors:Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
Title: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Abstract:
Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{https://mattie-e.github.io/Co3/}{\textit{https://mattie-e.github.io/Co3/}}.
中文: 本研究提出了一种新框架和数据集,用于合成双人对话中的同步手势,填补了并发语音手势建模的空白,并在性能上超越了现有方法。
English: This study introduces a novel framework and dataset for synthesizing synchronized gestures in two-person conversations, addressing the gap in concurrent co-speech gesture modeling and outperforming existing methods.

Authors:Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao
Title: T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Abstract:
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
中文: 文本到视频模型在质量和创意上虽有进步,但仍普遍违背基本物理规律;新基准T2VPhysBench通过人工评估和专项测试显示,主流系统在物理一致性上存在显著缺陷。
English: Text-to-video models have advanced in quality and creativity but still fail to adhere to fundamental physical laws, as demonstrated by the new benchmark T2VPhysBench, which reveals significant shortcomings across major systems through human evaluation and targeted studies.

Authors:Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng
Title: PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
Abstract:
Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration-driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.
中文: 大语言模型易受越狱攻击绕过安全防护,为此开发了PandaGuard多智能体框架系统评估攻防方法,发现没有单一防御能普遍适用,且评估者不一致性会影响安全判断。
English: Large language models are susceptible to jailbreak attacks that bypass safety measures, prompting the development of PandaGuard, a multi-agent framework for systematic evaluation of attack and defense methods, which reveals that no single defense is universally optimal and judge inconsistency affects safety assessments.

Authors:Tenglong Li, Jindong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng
Title: FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration with Dual-Engine Overlay Architecture
Abstract:
Spiking transformers are emerging as a promising architecture that combines the energy efficiency of Spiking Neural Networks (SNNs) with the powerful attention mechanisms of transformers. However, existing hardware accelerators lack support for spiking attention, exhibit limited throughput in exploiting fine-grained sparsity, and struggle with scalable parallelism in sparse computation. To address these, we propose FireFly-T, a dual-engine overlay architecture that integrates a sparse engine for activation sparsity and a binary engine for spiking attention. In the sparse engine, we propose a highthroughput sparse decoder that exploits fine-grained sparsity by concurrently extracting multiple non-zero spikes. To complement this, we introduce a scalable load balancing mechanism with weight dispatch and out-of-order execution, eliminating bank conflicts to support scalable multidimensional parallelism. In the binary engine, we leverage the byte-level write capability of SRAMs to efficiently manipulate the 3D dataflows required for spiking attention with minimal resource overhead. We also optimize the core AND-PopCount operation in spiking attention through a LUT6-based implementation, improving timing closure and reducing LUT utilization on Xilinx FPGAs. As an overlay architecture, FireFly-T further incorporates an orchestrator that dynamically manipulates input dataflows with flexible adaptation for diverse network topologies, while ensuring efficient resource utilization and maintaining high throughput. Experimental results demonstrate that our accelerator achieves $1.39\times$ and $2.40\times$ higher energy efficiency, as well as $4.21\times$ and $7.10\times$ greater DSP efficiency, compared to FireFly v2 and the transformer-enabled SpikeTA, respectively. These results highlight its potential as an efficient hardware platform for spiking transformer.
Chinese: FireFly-T是一种双引擎覆盖架构,通过高效处理脉冲注意力和利用细粒度稀疏性来增强脉冲变压器,相比现有加速器,实现了更高的能源和DSP效率。
English: FireFly-T is a dual-engine overlay architecture that enhances spiking transformers by efficiently handling spiking attention and exploiting fine-grained sparsity, achieving superior energy and DSP efficiency compared to existing accelerators.

Authors:An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu
Title: Qwen3 Technical Report
Abstract:
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
中文: Qwen3推出了融合思考与非思考模式的统一框架,支持动态推理与快速响应,模型参数规模从0.6B至235B,覆盖119种语言,在多领域基准测试中达到顶尖水平并全面开源。
English: Qwen3 introduces a unified framework integrating thinking and non-thinking modes for dynamic reasoning and rapid responses, featuring scalable models from 0.6B to 235B parameters and enhanced multilingual support across 119 languages while achieving state-of-the-art performance in benchmarks.

Authors:Yinglian Zhu, Haiyang Yu, Qizao Wang, Wei Lu, Xiangyang Xue, Bin Li
Title: Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning
Abstract:
Chinese Character Recognition (CCR) is a fundamental technology for intelligent document processing. Unlike Latin characters, Chinese characters exhibit unique spatial structures and compositional rules, allowing for the use of fine-grained semantic information in representation. However, existing approaches are usually based on auto-regressive as well as edit distance post-process and typically rely on a single-level character representation. In this paper, we propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm. To leverage the abundant fine-grained semantic information of Chinese characters, we propose multi-granularity encoders on both image and text sides. Specifically, the Image Multi-Granularity Encoder extracts hierarchical image representations from character images, capturing semantic cues from localized strokes to holistic structures. The Text Multi-Granularity Encoder extracts stroke and radical sequence representations at different levels of granularity. To better capture the relationships between strokes and radicals, we introduce Multi-Granularity Fusion Modules on the image and text sides, respectively. Furthermore, to effectively bridge the two modalities, we further introduce a Fine-Grained Decoupled Image-Text Contrastive loss, which aligns image and text representations across multiple granularities. Extensive experiments demonstrate that our proposed Hi-GITA significantly outperforms existing zero-shot CCR methods. For instance, it brings about 20% accuracy improvement in handwritten character and radical zero-shot settings. Code and models will be released soon.
中文摘要:本文提出Hi-GITA分层多粒度图文对齐框架,通过对比学习在笔画、部首和字符多层级对齐图文表征,在零样本汉字识别任务中显著优于现有方法,准确率最高提升20%。
English Summary: The paper introduces Hi-GITA, a hierarchical multi-granularity framework that leverages contrastive learning to align image and text representations across stroke, radical, and character levels, significantly outperforming existing zero-shot Chinese character recognition methods with up to 20% accuracy improvements.

Authors:Baolin Zheng, Guanlin Chen, Hongqiong Zhong, Qingyang Teng, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Huiyun Jing, Jincheng Wei, Wenbo Su, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
Title: USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Abstract:
Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks -- even the union of the vast majority of them -- are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive, high-quality complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories, covering both English and Chinese across both vulnerability and oversensitivity dimensions.
中文: 多模态大语言模型存在显著安全漏洞,为此提出统一安全基准(USB),通过高质量查询和全面模态组合,综合评估脆弱性和过度敏感性,覆盖多种风险类别。
English: Multimodal Large Language Models face significant security vulnerabilities, prompting the introduction of the Unified Safety Benchmarks (USB) to comprehensively evaluate risks across diverse categories and modalities while addressing both vulnerability and oversensitivity.

Authors:Shaoqing Zhang, Kehai Chen, Zhuosheng Zhang, Rumei Li, Rongxiang Weng, Yang Xiang, Min Zhang
Title: XBOUND: Exploring Capability Boundaries of Device-Control Agents at the State Level
Abstract:
Recent advancements in vision-language models have increased interest in Device-Control Agents (DC agents) for managing graphical user interfaces (GUIs). With the growing complexity and integration of such agents into various applications, effective evaluation methods have become crucial. The current evaluation method for DC agents primarily focuses on the instruction level, providing the current state (e.g., screenshots) and past execution history to determine actions for target instructions, helping identify potential execution failures. However, in GUI environments, a single state may contain multiple interactive widgets, each linked to different instructions, presenting an opportunity for diverse actions based on various instruction targets. Evaluating the agent's performance solely at the instruction level may overlook the broader context of these interactions. To capture a more comprehensive view of agent performance, we propose a new evaluation method, XBOUND, to evaluate the accuracy of instruction completion on a per-state basis. XBOUND provides a state-level evaluation framework, serving as a tool to assess agents' capabilities within environmental states. Our evaluation yields several key insights: UI-TARS stands out as the strongest 7B model, current agents display a bimodal performance pattern in instruction unification, and sub-7B models remain limited in state mastery. We further identify GPT-based planning as a critical bottleneck, and show that grounding data mainly benefits action matching, while trajectory data is more effective for instruction unification.
Chinese: 随着视觉语言模型的进步,设备控制代理在图形用户界面管理中的应用日益增多,为此我们提出了XBOUND这一状态级评估方法,揭示了代理性能的关键发现,如UI-TARS是最强的7B模型,而次7B模型在状态掌握上仍受限。
English: Recent advancements in vision-language models have spurred interest in Device-Control Agents for GUI management, leading to the development of XBOUND, a state-level evaluation method that reveals key insights about agent performance, including UI-TARS as the top 7B model and the limitations of sub-7B models.

Authors:Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang
Title: Evaluating and Steering Modality Preferences in Multimodal Large Language Model
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
中文: 多模态大语言模型普遍存在明显的模态偏好,通过表征工程技术可系统评估并显式调控该偏好,从而有效应用于幻觉消减和机器翻译等下游任务的性能提升。
English: Multimodal large language models generally exhibit clear modality bias that can be systematically evaluated and explicitly controlled through representation engineering, enabling downstream applications like hallucination mitigation and improved machine translation.

Authors:Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang
Title: Evaluating and Steering Modality Preferences in Multimodal Large Language Model
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
中文: 多模态大语言模型普遍存在明显的模态偏好,通过表征工程技术可系统评估并显式调控该偏好,从而有效应用于幻觉消减和机器翻译等下游任务的性能提升。
English: Multimodal large language models generally exhibit clear modality bias that can be systematically evaluated and explicitly controlled through representation engineering, enabling downstream applications like hallucination mitigation and improved machine translation.

Authors:Baihui Zheng, Boren Zheng, Kerui Cao, Yingshui Tan, Zhendong Liu, Weixun Wang, Jiaheng Liu, Jian Yang, Wenbo Su, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
Title: Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Abstract:
Despite the remarkable proficiency of \textit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf{\textit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0\% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
中文摘要:大型推理模型存在“表面安全对齐”现象,即模型输出看似安全但内部推理未能真正识别风险,新提出的BSA基准显示顶尖模型在风险归因识别上仅达38%准确率,揭示了安全推理一致性的重大挑战。
English Summary: Large Reasoning Models often exhibit Superficial Safety Alignment, producing seemingly safe responses while failing to genuinely identify risks internally, as demonstrated by the newly introduced BSA benchmark showing even top models struggle with only 38% accuracy in risk rationale identification.

Authors:Ziqing Xing, Zhaoyang Zhang, Zirui Chen, Hongning Ruan, Zhaohui Yang
Title: Multi-View Wireless Sensing via Conditional Generative Learning: Framework and Model Design
Abstract:
In this paper, we incorporate physical knowledge into learning-based high-precision target sensing using the multi-view channel state information (CSI) between multiple base stations (BSs) and user equipment (UEs). Such kind of multi-view sensing problem can be naturally cast into a conditional generation framework. To this end, we design a bipartite neural network architecture, the first part of which uses an elaborately designed encoder to fuse the latent target features embedded in the multi-view CSI, and then the second uses them as conditioning inputs of a powerful generative model to guide the target's reconstruction. Specifically, the encoder is designed to capture the physical correlation between the CSI and the target, and also be adaptive to the numbers and positions of BS-UE pairs. Therein the view-specific nature of CSI is assimilated by introducing a spatial positional embedding scheme, which exploits the structure of electromagnetic(EM)-wave propagation channels. Finally, a conditional diffusion model with a weighted loss is employed to generate the target's point cloud from the fused features. Extensive numerical results demonstrate that the proposed generative multi-view (Gen-MV) sensing framework exhibits excellent flexibility and significant performance improvement on the reconstruction quality of target's shape and EM properties.
中文摘要:本文提出了一种生成式多视角感知框架,通过将电磁波传播的物理知识融入神经网络架构,结合多视角信道状态信息融合与条件扩散模型,实现了目标形状和电磁特性的高质量重建,并展现出卓越的性能提升。
English Summary: This paper introduces a generative multi-view sensing framework that integrates physical knowledge of electromagnetic wave propagation into a neural network architecture, combining multi-view channel state information fusion with conditional diffusion modeling to achieve high-precision target reconstruction with significant performance improvements.

Authors:Xiaoqi Li, Lingyun Xu, Mingxu Zhang, Jiaming Liu, Yan Shen, Iaroslav Ponomarenko, Jiahui Xu, Liang Heng, Siyuan Huang, Shanghang Zhang, Hao Dong
Title: CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Abstract:
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space. Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts. We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
中文:CrayonRobo通过叠加在图像上的简易二维视觉提示,明确传达底层动作与高层规划,利用序列化关键帧执行实现复杂任务的鲁棒机器人操作。
English: CrayonRobo introduces a multi-modal system using simple 2D visual prompts overlaid on images to clearly specify both low-level actions and high-level planning, enabling robust robotic manipulation in complex tasks through sequential key-frame execution.

Authors:Fenghao Zhu, Xinquan Wang, Chen Zhu, Tierui Gong, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang, Mérouane Debbah
Title: Robust Deep Learning-Based Physical Layer Communications: Strategies and Approaches
Abstract:
Deep learning (DL) has emerged as a transformative technology with immense potential to reshape the sixth-generation (6G) wireless communication network. By utilizing advanced algorithms for feature extraction and pattern recognition, DL provides unprecedented capabilities in optimizing the network efficiency and performance, particularly in physical layer communications. Although DL technologies present the great potential, they also face significant challenges related to the robustness, which are expected to intensify in the complex and demanding 6G environment. Specifically, current DL models typically exhibit substantial performance degradation in dynamic environments with time-varying channels, interference of noise and different scenarios, which affect their effectiveness in diverse real-world applications. This paper provides a comprehensive overview of strategies and approaches for robust DL-based methods in physical layer communications. First we introduce the key challenges that current DL models face. Then we delve into a detailed examination of DL approaches specifically tailored to enhance robustness in 6G, which are classified into data-driven and model-driven strategies. Finally, we verify the effectiveness of these methods by case studies and outline future research directions.
中文摘要:深度学习虽能优化6G无线网络性能,却在动态环境中面临鲁棒性挑战,本文系统综述了数据驱动与模型驱动两类增强通信可靠性的方法。
English Summary: Deep learning offers transformative potential for optimizing 6G wireless networks but faces robustness challenges in dynamic environments, prompting this comprehensive review of data-driven and model-driven strategies to enhance reliability.

Authors:Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, Fei Wu, Hongxia Yang
Title: Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Abstract:
Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini). Resources are available at https://huggingface.co/Reallm-Labs/Infi-MMR-3B.
Chinese: 尽管大语言模型在推理能力上取得进展,但将其扩展至多模态小语言模型面临数据集稀缺和推理能力退化等挑战,为此提出的Infi-MMR框架通过三阶段课程设计释放模型推理潜力,在多模态和通用推理任务中达到领先水平。
English: Recent advancements in large language models have improved reasoning, but extending these to multimodal small language models (MSLMs) faces challenges like dataset scarcity and capability degradation, which the novel Infi-MMR framework addresses through a three-phase curriculum to unlock MSLMs' reasoning potential, achieving state-of-the-art results in multimodal and general reasoning tasks.

Authors:Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She, Xiao Liu, Yeyun Gong, Shujian Huang, Jiajun Chen
Title: How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
Abstract:
Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs' mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.
中文摘要:本研究提出了一种细粒度神经元识别算法,通过分类语言特定、语言相关和语言无关的神经元,系统分析了它们在多语言处理和对齐中的作用机制与分布特征。
English Summary: This study introduces a fine-grained neuron identification algorithm to classify language-specific, language-related, and language-agnostic neurons in LLMs, revealing their roles in multilingual processing and alignment through empirical analysis.

Authors:Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, Tao Chen
Title: Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.
Chinese: FlashVLA是一种无需训练的即插即用加速框架,通过动作重用和视觉令牌剪枝提升VLA模型效率,在计算开销和延迟大幅降低的同时保持任务成功率基本不变。
English: FlashVLA is a training-free acceleration framework that enhances VLA model efficiency by reusing actions and pruning visual tokens, achieving significant reductions in computational cost and latency with minimal impact on performance.

Authors:Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu
Title: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
Abstract:
Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.
中文:提出的MLLM语义校正乒乓前瞻扩散(PPAD)框架在扩散推理过程中引入实时语义观察,利用多模态大语言模型识别语义不一致性并主动引导去噪步骤,从而显著提升文本-图像对齐效果和生成质量。
English: The proposed MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) framework introduces real-time semantic observation during diffusion inference, using Multimodal Large Language Models to identify inconsistencies and actively guide denoising steps for improved prompt-image alignment and generation quality.

Authors:Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji
Title: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
Abstract:
Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
中文摘要:本文提出一种认知启发的开放词汇图像分割框架,通过模拟人类先理解语义概念再感知空间范围的视觉识别过程,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces a Cognition-Inspired Framework for open vocabulary image segmentation that mimics human visual recognition by first generating semantic concepts before segmenting regions, achieving state-of-the-art performance across multiple benchmarks.

Authors:Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang
Title: PATS: Process-Level Adaptive Thinking Mode Switching
Abstract:
Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.
中文摘要:本文提出PATS这一新型推理范式,使大语言模型能够根据每个步骤的难度动态调整推理策略,通过集成奖励模型与搜索机制,在保证高精度的同时维持适中的计算开销。
English Summary: This paper introduces PATS, a novel reasoning paradigm that enables large-language models to dynamically adjust their reasoning strategy at each process step based on difficulty, effectively balancing accuracy with computational efficiency through integrated reward models and search mechanisms.

Authors:Renfei Dang, Shujian Huang, Jiajun Chen
Title: Internal Bias in Reasoning Models leads to Overthinking
Abstract:
While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer, which we term as an internal bias since it is not derived through actual reasoning. When this guess conflicts with its reasoning result, the model tends to engage in reflection, leading to the waste of computational resources. Through further interpretability experiments, we find that this behavior is largely driven by the model's excessive attention to the input section, which amplifies the influence of internal bias on its decision-making process. Additionally, by masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53% across different complex reasoning tasks. Notably, in most cases, this approach also leads to improvements in accuracy. These findings demonstrate a causal relationship between internal bias and overthinking.
Chinese: 推理模型常因输入问题引发的内部偏见而产生过度思考,当该偏见与后续推理冲突时导致冗余步骤,这一现象在多种任务和干预措施中得到验证。
English: Reasoning models often overthink due to an internal bias triggered by the input question, leading to redundant steps when this bias conflicts with subsequent reasoning, as validated across multiple tasks and interventions.

Authors:Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen
Title: The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models
Abstract:
Reasoning models often exhibit overthinking, characterized by redundant reasoning steps. We identify \emph{internal bias} elicited by the input question as a key trigger of such behavior. Upon encountering a problem, the model immediately forms a preliminary guess about the answer, which we term an internal bias since it may not be explicitly generated, and it arises without systematic reasoning. When this guess conflicts with its subsequent reasoning, the model tends to engage in excessive reflection, resulting in wasted computation. We validate the association between internal bias and overthinking across multiple models and diverse reasoning tasks. To demonstrate the causal relationship more rigorously, we conduct two counterfactual interventions, showing that removing the input question after the model reduces the redundant reasoning across various complex reasoning tasks, and manually injecting bias affects overthinking accordingly. Further interpretability experiments suggest that excessive attention to the input question serves as a key mechanism through which internal bias influences subsequent reasoning trajectories. Finally, we evaluated several methods aimed at mitigating overthinking, yet the influence of internal bias persisted under all conditions.
Chinese: 推理模型常因输入问题引发的内部偏见而产生过度思考,当该偏见与后续推理冲突时导致冗余步骤,这一现象在多种任务和干预措施中得到验证。
English: Reasoning models often overthink due to an internal bias triggered by the input question, leading to redundant steps when this bias conflicts with subsequent reasoning, as validated across multiple tasks and interventions.

Authors:Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, Peng Ye
Title: Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging
Abstract:
With more open-source models available for diverse tasks, model merging has gained attention by combining models into one, reducing training, storage, and inference costs. Current research mainly focuses on model merging for full fine-tuning, overlooking the popular LoRA. However, our empirical analysis reveals that: a) existing merging methods designed for full fine-tuning perform poorly on LoRA; b) LoRA modules show much larger parameter magnitude variance than full fine-tuned weights; c) greater parameter magnitude variance correlates with worse merging performance. Considering that large magnitude variances cause deviations in the distribution of the merged parameters, resulting in information loss and performance degradation, we propose a Decoupled and Orthogonal merging approach(DO-Merging). By separating parameters into magnitude and direction components and merging them independently, we reduce the impact of magnitude differences on the directional alignment of the merged models, thereby preserving task information. Furthermore, we introduce a data-free, layer-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components. We provide theoretical guarantees for both the decoupling and orthogonal components. And we validate through extensive experiments across vision, language, and multi-modal domains that our proposed DO-Merging can achieve significantly higher performance than existing merging methods at a minimal cost. Notably, each component can be flexibly integrated with existing methods, offering near free-lunch improvements across tasks.
中文摘要:针对LoRA模块因参数幅度差异大导致模型合并困难的问题,我们提出的DO-Merging方法通过解耦参数并正交合并,有效保留任务信息,在多领域实现更优性能。
English Summary: Model merging for LoRA modules is challenging due to large parameter magnitude variances, but the proposed DO-Merging approach decouples and orthogonally merges parameters to preserve task information and achieve superior performance across domains.

Authors:Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Title: The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
Abstract:
Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
中文: 本研究通过识别处理指令和检索上下文知识的专用注意力头,揭示了上下文检索增强问答机制,并提出追溯知识源的方法,以构建更安全透明的语言模型。
English: This study investigates the mechanism of in-context retrieval-augmented question answering by identifying specialized attention heads that process instructions and retrieve contextual knowledge, and proposes methods to trace knowledge sources for safer, more transparent language models.

Authors:Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
Title: Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification
Abstract:
Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
Chinese: 本文提出了一种无需训练的推测解码算法,通过对齐采样和灵活验证策略提升草稿与目标的匹配度,显著提高了大语言模型的生成速度与准确性。
English: This paper introduces a training-free speculative decoding algorithm that enhances draft-target alignment through alignment sampling and flexible verification, achieving significant improvements in generation speed and accuracy for large language models.

Authors:Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, Jun Xu
Title: Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective
Abstract:
Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged--Reasoning Hallucination--where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern-matching from genuine deep reasoning. Using this score, we conduct an in-depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early-stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our Reasoning Hallucination Detection (RHD) framework, which achieves state-of-the-art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.
中文摘要:大型推理模型存在推理幻觉问题,即逻辑连贯但事实错误的推理导致错误结论,为此提出的检测框架和增强型强化学习方法显著提升了推理质量并降低了错误率。
English Summary: Large Reasoning Models exhibit reasoning hallucinations where logically coherent but factually flawed reasoning leads to incorrect conclusions, prompting the development of a detection framework and enhanced reinforcement learning method that significantly improves reasoning quality and reduces errors.

Authors:Yuyang Ding, Dan Qiao, Juntao Li, Jiajie Xu, Pingfu Chao, Xiaofang Zhou, Min Zhang
Title: Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations
Abstract:
Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.
Chinese: 本研究通过分析不同远程标注方法间的潜在噪声分布,提出一个分别处理未标注实体和噪声实体问题的新框架,在八个真实数据集上实现了最先进的性能,推动了远程监督命名实体识别的发展。
English: This study advances distantly supervised named entity recognition by analyzing latent noise distribution across annotation methods and introducing a novel framework that separately addresses unlabeled-entity and noisy-entity problems, achieving state-of-the-art performance on eight real-world datasets.

Authors:Jianghang Lin, Yilin Lu, Yunhang Shen, Chaoyang Zhu, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Title: Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation
Abstract:
Semi-Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data. This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo-labels of instance categories and pixel masks. We find that the prevalent practice of filtering instance pseudo-labels assessing both class and mask quality with a single score threshold, frequently leads to compromises in the trade-off between the qualities of class and mask labels. In this paper, we introduce a novel Pseudo-Label Quality Decoupling and Correction (PL-DC) framework for SSIS to tackle the above challenges. Firstly, at the instance level, a decoupled dual-threshold filtering mechanism is designed to decouple class and mask quality estimations for instance-level pseudo-labels, thereby independently controlling pixel classifying and grouping qualities. Secondly, at the category level, we introduce a dynamic instance category correction module to dynamically correct the pseudo-labels of instance categories, effectively alleviating category confusion. Lastly, we introduce a pixel-level mask uncertainty-aware mechanism at the pixel level to re-weight the mask loss for different pixels, thereby reducing the impact of noise introduced by pixel-level mask pseudo-labels. Extensive experiments on the COCO and Cityscapes datasets demonstrate that the proposed PL-DC achieves significant performance improvements, setting new state-of-the-art results for SSIS. Notably, our PL-DC shows substantial gains even with minimal labeled data, achieving an improvement of +11.6 mAP with just 1% COCO labeled data and +15.5 mAP with 5% Cityscapes labeled data. The code will be public.
中文:提出的PL-DC框架通过双阈值过滤解耦类别与掩码质量评估、动态修正实例类别以及基于像素不确定性的掩码损失重加权,有效解决了半监督实例分割中的关键难题,在COCO和Cityscapes数据集上实现了最先进的性能。
English: The proposed PL-DC framework addresses semi-supervised instance segmentation challenges by decoupling class and mask quality assessments with dual-threshold filtering, dynamically correcting instance categories, and re-weighting mask loss based on pixel uncertainty, achieving state-of-the-art performance on COCO and Cityscapes datasets.

Authors:Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Title: Accurate KV Cache Quantization with Outlier Tokens Tracing
Abstract:
The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
大型语言模型在部署时面临计算负担,但我们的方法在KV缓存量化过程中识别并排除异常标记,显著提高了精度,同时降低了内存使用并提升了吞吐量。
Large language models face computational burdens during deployment, but our method identifies and excludes outlier tokens during KV Cache quantization, significantly enhancing accuracy while reducing memory usage and boosting throughput.

Authors:Biao Yi, Xavier Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, Fan Wu, Fei Wu
Title: EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Abstract:
Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose \textbf{EcoAgent}, an \textbf{E}dge-\textbf{C}loud c\textbf{O}llaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage and communication overhead. In case of failure, the Planning Agent retrieves screen history through a Memory Module and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent achieves task success rates comparable to cloud-based mobile agents while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.
中文: EcoAgent是一种边缘-云协作框架,通过规划、执行和观察智能体的闭环协作,在保持与云端方案相当任务成功率的同时,显著降低了多模态大模型的令牌消耗和通信开销。
English: EcoAgent is an edge-cloud collaborative framework that uses specialized agents for planning, execution, and observation to achieve mobile automation efficiency comparable to cloud-based systems while drastically reducing token usage and latency.

Authors:Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
Title: On Path to Multimodal Generalist: General-Level and General-Bench
Abstract:
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/
中文摘要:本文提出General-Level评估框架,通过定义多模态大语言模型的五级性能标准并构建包含700多项任务的General-Bench测试集,对100余个现有模型进行评估,揭示了当前模型在实现真正多模态通才和通用人工智能方面的能力差距。
English Summary: This paper introduces General-Level, a novel evaluation framework that defines five performance levels for Multimodal Large Language Models (MLLMs) and proposes General-Bench with over 700 tasks to assess their progress toward robust multimodal generalists and AGI, revealing current limitations through testing 100+ models.

Authors:Zhiding Liu, Mingyue Cheng, Guanhao Zhao, Jiqian Yang, Qi Liu, Enhong Chen
Title: Improving Time Series Forecasting via Instance-aware Post-hoc Revision
Abstract:
Time series forecasting plays a vital role in various real-world applications and has attracted significant attention in recent decades. While recent methods have achieved remarkable accuracy by incorporating advanced inductive biases and training strategies, we observe that instance-level variations remain a significant challenge. These variations--stemming from distribution shifts, missing data, and long-tail patterns--often lead to suboptimal forecasts for specific instances, even when overall performance appears strong. To address this issue, we propose a model-agnostic framework, PIR, designed to enhance forecasting performance through Post-forecasting Identification and Revision. Specifically, PIR first identifies biased forecasting instances by estimating their accuracy. Based on this, the framework revises the forecasts using contextual information, including covariates and historical time series, from both local and global perspectives in a post-processing fashion. Extensive experiments on real-world datasets with mainstream forecasting models demonstrate that PIR effectively mitigates instance-level errors and significantly improves forecasting reliability.
中文: PIR框架通过识别有偏差的预测并利用上下文信息进行修正,有效解决了时间序列预测中的实例级变化问题,显著提升了实际数据集上的预测可靠性。
English: The PIR framework addresses instance-level variations in time series forecasting by identifying biased forecasts and revising them using contextual information, significantly improving reliability across real-world datasets.

Authors:Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe
Title: Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
Abstract:
While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format -- typically a flat, lay-down-style representation of the garment -- making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.
本文提出了TEMU-VTOFF系统,通过多模态输入和双DiT架构解决虚拟试衣的逆向任务——从着装人像生成标准化服装图像,在多类别设置下显著提升了生成质量和保真度,创造了该领域的最新性能标杆。
This paper introduces TEMU-VTOFF, a novel multi-category virtual try-off system that uses multimodal inputs and a dual DiT backbone to overcome pose and occlusion challenges, setting new state-of-the-art performance in generating standardized garment images from clothed person photos.

Authors:Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
Title: EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
Abstract:
In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud's Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.
中文: EasyDistill是一个全面的工具包,支持大语言模型的黑盒与白盒知识蒸馏,提供模块化功能和与阿里云的集成,旨在提升自然语言处理领域的可及性和实用性。
English: EasyDistill is a versatile toolkit that facilitates black-box and white-box knowledge distillation for large language models, offering modular features and integration with Alibaba Cloud to enhance accessibility and application in NLP.

Authors:Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Title: ChartLens: Fine-grained Visual Attribution in Charts
Abstract:
The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
中文: 本文提出ChartLens算法,通过基于分割的技术和集合标记提示方法,将图表理解的细粒度视觉归因性能提升26-66%,有效解决多模态模型中的幻觉问题。
English: This paper introduces ChartLens, a novel algorithm that enhances fine-grained visual attribution in chart understanding by 26-66% through segmentation-based techniques and set-of-marks prompting with MLLMs, addressing hallucinations in multimodal models.

Authors:Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang
Title: A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models
Abstract:
Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.
中文摘要:本研究从图结构视角探究大语言模型的知识结构模式,揭示了知识同质性现象,并开发基于图的模型来评估实体知识及筛选三元组,从而提升微调性能。
English Summary: This study investigates the structural patterns of knowledge in large language models from a graph perspective, revealing knowledge homophily and developing a graph-based model to estimate entity knowledge and select triplets for improved fine-tuning performance.

Authors:Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang
Title: SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Abstract:
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
中文: SeePhys是一个大规模多模态物理推理基准,其核心特征在于75%的问题必须依赖视觉信息求解,当前最先进的视觉推理模型在此基准上表现不足60%,暴露了它们在图表理解与物理推理结合方面的根本缺陷。
English: SeePhys is a large-scale multimodal benchmark for evaluating LLM reasoning in physics, featuring predominantly vision-essential problems that challenge current advanced models, revealing significant difficulties in integrating visual understanding with physics reasoning and overcoming reliance on text.

Authors:Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang
Title: SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Abstract:
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
中文: SeePhys是一个大规模多模态物理推理基准,其核心特征在于75%的问题必须依赖视觉信息求解,当前最先进的视觉推理模型在此基准上表现不足60%,暴露了它们在图表理解与物理推理结合方面的根本缺陷。
English: SeePhys is a large-scale multimodal benchmark for evaluating LLM reasoning in physics, featuring predominantly vision-essential problems that challenge current advanced models, revealing significant difficulties in integrating visual understanding with physics reasoning and overcoming reliance on text.

Authors:Valérie Hayot-Sasson, Abby Stevens, Nicholson Collier, Sudershan Sridhar, Kyle Conroy, J. Gregory Pauloski, Yadu Babuji, Maxime Gonthier, Nathaniel Hudson, Dante D. Sanchez-Gallegos, Ian Foster, Jonathan Ozik, Kyle Chard
Title: AERO: An autonomous platform for continuous research
Abstract:
The COVID-19 pandemic highlighted the need for new data infrastructure, as epidemiologists and public health workers raced to harness rapidly evolving data, analytics, and infrastructure in support of cross-sector investigations. To meet this need, we developed AERO, an automated research and data sharing platform for continuous, distributed, and multi-disciplinary collaboration. In this paper, we describe the AERO design and how it supports the automatic ingestion, validation, and transformation of monitored data into a form suitable for analysis; the automated execution of analyses on this data; and the sharing of data among different entities. We also describe how our AERO implementation leverages capabilities provided by the Globus platform and GitHub for automation, distributed execution, data sharing, and authentication. We present results obtained with an instance of AERO running two public health surveillance applications and demonstrate benchmarking results with a synthetic application, all of which are publicly available for testing.
中文: AERO平台利用Globus和GitHub实现自动化数据处理与分布式协作,通过自动采集、验证和分析数据支持跨学科公共卫生监测应用。
English: The AERO platform was developed to automate data ingestion, validation, analysis, and sharing for multi-disciplinary pandemic response, leveraging Globus and GitHub for distributed execution and collaboration.

Authors:Junfeng Wu, Dongliang Luo, Weizhi Zhao, Zhihao Xie, Yuanhao Wang, Junyi Li, Xudong Xie, Yuliang Liu, Xiang Bai
Title: TokBench: Evaluating Your Visual Tokenizer before Visual Generation
Abstract:
In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process requiring just 2GB memory and 4 minutes to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.
本研究揭示了视觉分词器和VAE在保留细粒度特征方面的局限性,提出了一个轻量级基准来评估文本和人脸在不同尺度下的重建质量,并发现它们在处理小尺度特征时仍存在显著不足。
This study exposes the limitations of visual tokenizers and VAEs in retaining fine-grained details, proposing a lightweight benchmark to evaluate text and face reconstruction quality across various scales and revealing their persistent challenges with small-scale features.

Authors:Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye
Title: CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Abstract:
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
中文: CosyVoice 3 是一款改进的多语言语音合成模型,通过引入新型语音分词器、可微分奖励模型以及大幅扩展的训练数据和参数,在内容一致性、说话人相似度和韵律自然度上超越了前代模型。
English: CosyVoice 3 is an enhanced multilingual speech synthesis model that surpasses its predecessor by incorporating a novel speech tokenizer, a differentiable reward model, and significantly expanded training data and parameters, achieving superior performance in content consistency, speaker similarity, and prosody naturalness.

Authors:Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao
Title: A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
Abstract:
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
Chinese: PersonaConvBench是一个大规模基准,旨在评估大型语言模型在多轮对话中的个性化推理和生成能力,通过整合个性化和对话结构,在多样化领域中分析个性化上下文如何影响模型输出,并显著提升性能。
English: PersonaConvBench is a large-scale benchmark designed to evaluate personalized reasoning and generation in multi-turn conversations with LLMs, integrating both personalization and conversational structure across three core tasks in diverse domains to analyze how personalized context influences model outputs and improve performance significantly.

Authors:Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang
Title: Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization
Abstract:
Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.
中文: 本文通过引入维度正则化解决模型塌陷问题并结合得分归一化技术,改进了自监督说话人验证框架SDPN,在VoxCeleb1基准测试中取得最优性能,相较现有方法实现显著提升。
English: This paper enhances the self-supervised speaker verification framework SDPN by introducing dimension regularization to prevent collapse and integrating score normalization, achieving state-of-the-art performance on VoxCeleb1 benchmarks with significant improvements over existing methods.

Authors:Mansi Sakarvadia, Nathaniel Hudson, Tian Li, Ian Foster, Kyle Chard
Title: Topology-Aware Knowledge Propagation in Decentralized Learning
Abstract:
Decentralized learning enables collaborative training of models across naturally distributed data without centralized coordination or maintenance of a global model. Instead, devices are organized in arbitrary communication topologies, in which they can only communicate with neighboring devices. Each device maintains its own local model by training on its local data and integrating new knowledge via model aggregation with neighbors. Therefore, knowledge is propagated across the topology via successive aggregation rounds. We study, in particular, the propagation of out-of-distribution (OOD) knowledge. We find that popular decentralized learning algorithms struggle to propagate OOD knowledge effectively to all devices. Further, we find that both the location of OOD data within a topology, and the topology itself, significantly impact OOD knowledge propagation. We then propose topology-aware aggregation strategies to accelerate (OOD) knowledge propagation across devices. These strategies improve OOD data accuracy, compared to topology-unaware baselines, by 123% on average across models in a topology.
中文摘要:去中心化学习使设备能够利用本地数据和邻居通信协作训练模型,但在传播分布外知识方面存在困难,而采用拓扑感知的聚合策略可显著提升其传播效果。
English Summary: Decentralized learning allows devices to collaboratively train models using local data and neighbor communication, but struggles with propagating out-of-distribution knowledge effectively, which can be significantly improved through topology-aware aggregation strategies.

Authors:An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, Yuliang Liu, Xiang Bai, Can Huang
Title: WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
Abstract:
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at https://bytedance.github.io/WildDoc.
中文摘要:本文提出了首个针对自然环境下文档理解评估的基准WildDoc,揭示了当前最先进的多模态模型在面对真实世界光照变化和物理变形等挑战时出现的显著性能下降问题。
English Summary: This paper introduces WildDoc, the first benchmark for evaluating document understanding in natural environments, revealing significant performance drops in state-of-the-art multimodal models when faced with real-world challenges like variable lighting and physical distortions.

Authors:Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Title: Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations
Abstract:
The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.
Chinese: 大型推理模型(LRMs)虽推动了自然语言处理的发展,但因缺乏全面的思维链数据集而受限;为此我们提出了OmniThought数据集,包含200万条带新型评分标注的思维链流程,显著提升了LRM的训练效果与推理能力。
English: Large reasoning models (LRMs) have advanced natural language processing but face limitations due to insufficient chain-of-thought (CoT) datasets, prompting the introduction of OmniThought—a comprehensive dataset with 2 million annotated CoT processes and novel scoring metrics that enhance LRM training and performance.

Authors:Daniel Geissler, Lars Krupp, Vishal Banwari, David Habusch, Bo Zhou, Paul Lukowicz, Jakob Karolus
Title: Human in the Latent Loop (HILL): Interactively Guiding Model Training Through Human Intuition
Abstract:
Latent space representations are critical for understanding and improving the behavior of machine learning models, yet they often remain obscure and intricate. Understanding and exploring the latent space has the potential to contribute valuable human intuition and expertise about respective domains. In this work, we present HILL, an interactive framework allowing users to incorporate human intuition into the model training by interactively reshaping latent space representations. The modifications are infused into the model training loop via a novel approach inspired by knowledge distillation, treating the user's modifications as a teacher to guide the model in reshaping its intrinsic latent representation. The process allows the model to converge more effectively and overcome inefficiencies, as well as provide beneficial insights to the user. We evaluated HILL in a user study tasking participants to train an optimal model, closely observing the employed strategies. The results demonstrated that human-guided latent space modifications enhance model performance while maintaining generalization, yet also revealing the risks of including user biases. Our work introduces a novel human-AI interaction paradigm that infuses human intuition into model training and critically examines the impact of human intervention on training strategies and potential biases.
Chinese Summary: HILL框架允许用户在模型训练过程中交互式重塑潜在空间表示,通过融入人类直觉提升模型性能,同时批判性审视人为干预带来的偏见影响。
English Summary: The HILL framework enables users to interactively reshape latent space representations during model training, enhancing performance through human intuition while critically examining biases introduced by human intervention.

Authors:Vipula Rawte, Ryan A. Rossi, Franck Dernoncourt, Nedim Lipka
Title: Document Attribution: Examining Citation Relationships using Large Language Models
Abstract:
As Large Language Models (LLMs) are increasingly applied to document-based tasks - such as document summarization, question answering, and information extraction - where user requirements focus on retrieving information from provided documents rather than relying on the model's parametric knowledge, ensuring the trustworthiness and interpretability of these systems has become a critical concern. A central approach to addressing this challenge is attribution, which involves tracing the generated outputs back to their source documents. However, since LLMs can produce inaccurate or imprecise responses, it is crucial to assess the reliability of these citations. To tackle this, our work proposes two techniques. (1) A zero-shot approach that frames attribution as a straightforward textual entailment task. Our method using flan-ul2 demonstrates an improvement of 0.27% and 2.4% over the best baseline of ID and OOD sets of AttributionBench, respectively. (2) We also explore the role of the attention mechanism in enhancing the attribution process. Using a smaller LLM, flan-t5-small, the F1 scores outperform the baseline across almost all layers except layer 4 and layers 8 through 11.
中文摘要:本研究针对大型语言模型在文档任务中的可信度问题,提出了两种改进归因的方法:基于零样本文本蕴含的方法在AttributionBench上表现优异,以及通过注意力机制分析在多数网络层提升了F1值。
English Summary: This study introduces two methods to enhance attribution in Large Language Models for document-based tasks: a zero-shot textual entailment approach that improves performance on AttributionBench, and an analysis of attention mechanisms that boosts F1 scores across most model layers.

Authors:J. Gregory Pauloski, Yadu Babuji, Ryan Chard, Mansi Sakarvadia, Kyle Chard, Ian Foster
Title: Empowering Scientific Workflows with Federated Agents
Abstract:
Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
中文: Academy是一种模块化中间件,可在联合科研生态系统中部署自主智能体,支持异步执行与高通量数据流,已应用于材料发现、分布式学习等科学计算场景。
English: Academy is a modular middleware that enables autonomous agents to operate across federated research infrastructure, supporting asynchronous execution and high-throughput workflows for scientific applications like materials discovery and decentralized learning.

Authors:Ji Won Chung, Tongyu Zhou, Ivy Chen, Kevin Hsu, Ryan A. Rossi, Alexa Siu, Shunan Guo, Franck Dernoncourt, James Tompkin, Jeff Huang
Title: InfoVids: Reimagining the Viewer Experience with Alternative Visualization-Presenter Relationships
Abstract:
Traditional data presentations typically separate the presenter and visualization into two separate spaces--the 3D world and a 2D screen--enforcing visualization-centric stories. To create a more human-centric viewing experience, we establish a more equitable relationship between the visualization and the presenter through our InfoVids. These infographics-inspired informational videos are crafted to redefine relationships between the presenter and visualizations. As we design InfoVids, we explore how the use of layout, form, and interactions affects the viewer experience. We compare InfoVids against their baseline 2D `slides' equivalents across 9 metrics with 30 participants and provide practical, long-term insights from an autobiographical perspective. Our mixed methods analyses reveal that this paradigm reduced viewer attention splitting, shifted the focus from the visualization to the presenter, and led to more interactive, natural, and engaging full-body data performances for viewers. Ultimately, InfoVids helped viewers re-imagine traditional dynamics between the presenter and visualizations.
中文: InfoVids通过在统一空间中整合演示者和可视化内容,建立了更平等的关系,减少了观众注意力分散,并通过互动式全身数据演示提升了参与度。
English: InfoVids create a more equitable relationship between presenters and visualizations by integrating them into a unified space, reducing attention splitting and enhancing engagement through interactive full-body data performances.

Authors:Lala Shakti Swarup Ray, Lars Krupp, Vitor Fortes Rey, Bo Zhou, Sungho Suh, Paul Lukowicz
Title: TxP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition
Abstract:
Sensor-based human activity recognition (HAR) has predominantly focused on Inertial Measurement Units and vision data, often overlooking the capabilities unique to pressure sensors, which capture subtle body dynamics and shifts in the center of mass. Despite their potential for postural and balance-based activities, pressure sensors remain underutilized in the HAR domain due to limited datasets. To bridge this gap, we propose to exploit generative foundation models with pressure-specific HAR techniques. Specifically, we present a bidirectional Text$\times$Pressure model that uses generative foundation models to interpret pressure data as natural language. TxP accomplishes two tasks: (1) Text2Pressure, converting activity text descriptions into pressure sequences, and (2) Pressure2Text, generating activity descriptions and classifications from dynamic pressure maps. Leveraging pre-trained models like CLIP and LLaMA 2 13B Chat, TxP is trained on our synthetic PressLang dataset, containing over 81,100 text-pressure pairs. Validated on real-world data for activities such as yoga and daily tasks, TxP provides novel approaches to data augmentation and classification grounded in atomic actions. This consequently improved HAR performance by up to 12.4\% in macro F1 score compared to the state-of-the-art, advancing pressure-based HAR with broader applications and deeper insights into human movement.
中文: 本研究提出了一种双向文本-压力模型,利用生成式基础模型将压力传感器数据与自然语言相互转换,通过生成合成数据和改进分类,将人类活动识别的F1分数提升了高达12.4%。
English: This research introduces a bidirectional Text×Pressure model that leverages generative foundation models to bridge pressure sensor data with natural language, enhancing human activity recognition by generating synthetic data and improving classification accuracy by up to 12.4% in F1 score.

Authors:Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong
Title: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Abstract:
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
中文: 长期强化学习(ProRL)训练能使语言模型突破基础能力局限,开发出全新的推理策略,在不同任务中持续超越基础模型表现,证明强化学习能随时间推移有效拓展模型的推理边界。
English: Prolonged reinforcement learning (ProRL) training enables language models to develop novel reasoning strategies beyond their base capabilities, consistently outperforming base models across diverse tasks and demonstrating that RL meaningfully expands reasoning boundaries over time.

Authors:Chongjie Si, Xuankun Yang, Muqing Liu, Yadao Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, Wei Shen
Title: Weight Spectra Induced Efficient Model Adaptation
Abstract:
Large-scale foundation models have demonstrated remarkable versatility across a wide range of downstream tasks. However, fully fine-tuning these models incurs prohibitive computational costs, motivating the development of Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, which introduces low-rank updates to pre-trained weights. Despite their empirical success, the underlying mechanisms by which PEFT modifies model parameters remain underexplored. In this work, we present a systematic investigation into the structural changes of weight matrices during fully fine-tuning. Through singular value decomposition (SVD), we reveal that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact, suggesting that task-specific knowledge is injected into a low-dimensional subspace. Furthermore, we find that the dominant singular vectors are reoriented in task-specific directions, whereas the non-dominant subspace remains stable. Building on these insights, we propose a novel method that leverages learnable rescaling of top singular directions, enabling precise modulation of the most influential components without disrupting the global structure. Our approach achieves consistent improvements over strong baselines across multiple tasks, highlighting the efficacy of structurally informed fine-tuning.
中文: 本研究系统分析了微调如何改变权重矩阵的结构特性,发现任务特定知识主要通过放大和重定向顶部奇异值及向量来注入,基于此提出了一种新的微调方法,在多个任务上实现了持续改进。
English: This study systematically analyzes how fine-tuning modifies the structural properties of weight matrices, revealing that task-specific knowledge is primarily injected through amplification and reorientation of top singular values and vectors, leading to a novel fine-tuning method that achieves consistent improvements across tasks.

Authors:Xingjian Wu, Xiangfei Qiu, Hongfan Gao, Jilin Hu, Bin Yang, Chenjuan Guo
Title: $K^2$VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting
Abstract:
Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.
Chinese: 本文提出$K^2$VAE模型,通过KoopmanNet将非线性时间序列转化为线性系统,并利用KalmanNet优化预测和不确定性建模,有效提升长期概率时间序列预测的精度与效率。
English: The paper introduces $K^2$VAE, an efficient VAE-based generative model that transforms nonlinear time series into a linear system using KoopmanNet and refines predictions with KalmanNet to enhance long-term probabilistic forecasting accuracy and efficiency.

Authors:Jing Du, Haley Stone, Yang Yang, Ashna Desai, Hao Xue, Andreas Züfle, Chandini Raina MacIntyre, Flora D. Salim
Title: BLUE: Bi-layer Heterogeneous Graph Fusion Network for Avian Influenza Forecasting
Abstract:
Accurate forecasting of avian influenza outbreaks within wild bird populations requires models that account for complex, multi-scale transmission patterns driven by various factors. Spatio-temporal GNN-based models have recently gained traction for infection forecasting due to their ability to capture relations and flow between spatial regions, but most existing frameworks rely solely on spatial connections and their connections. This overlooks valuable genetic information at the case level, such as cases in one region being genetically descended from strains in another, which is essential for understanding how infectious diseases spread through epidemiological linkages beyond geography. We address this gap with BLUE, a B}i-Layer heterogeneous graph fUsion nEtwork designed to integrate genetic, spatial, and ecological data for accurate outbreak forecasting. The framework 1) builds heterogeneous graphs from multiple information sources and multiple layers, 2) smooths across relation types, 3) performs fusion while retaining structural patterns, and 4) predicts future outbreaks via an autoregressive graph sequence model that captures transmission dynamics over time. To facilitate further research, we introduce \textbf{Avian-US} dataset, the dataset for avian influenza outbreak forecasting in the United States, incorporating genetic, spatial, and ecological data across locations. BLUE achieves superior performance over existing baselines, highlighting the value of incorporating multi-layer information into infectious disease forecasting.
中文: BLUE是一种双层异构图融合网络,通过整合遗传、空间和生态数据,捕捉超越地理联系的复杂传播动态,从而精确预测禽流感疫情。
English: BLUE is a bi-layer heterogeneous graph fusion network that integrates genetic, spatial, and ecological data to accurately forecast avian influenza outbreaks by capturing complex transmission dynamics beyond mere geographical connections.

Authors:Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
Title: Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Abstract:
Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $ρ\approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.
中文摘要:语言模型的有效泛化能力取决于训练数据的多样性,G-Vendi指标通过模型梯度熵量化多样性,而Prismatic Synthesis框架可生成梯度空间欠覆盖区域的合成数据,从而显著提升分布外任务的性能表现。
English Summary: Effective generalization in language models is driven by training data diversity, which can be accurately measured by the G-Vendi metric and enhanced through the Prismatic Synthesis framework to significantly improve out-of-distribution performance.

Authors:Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng
Title: Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
Abstract:
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
Chinese: 大语言模型在长期决策中因探索不足和信用分配困难而受限,但GLIDER框架通过分层结构将复杂问题分解为连贯的思维链子任务,显著提升了稀疏奖励环境下任务的探索能力和泛化性能。
English: Large language models face challenges in long-term decision-making due to poor exploration and credit assignment, but the GLIDER framework introduces a hierarchical structure that decomposes complex tasks into manageable sub-tasks, significantly improving performance and adaptability in sparse-reward environments.

Authors:Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Title: Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Abstract:
Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
中文: 现有语言模型评估主要关注总体准确性,却忽视了识别非英语语言中罕见语法点等细微能力;通过测试日语心理谓词限制发现,Weblab因统一分词表现优异,而优化分词可使Llama 3的语法句困惑度降低28倍。
English: Current language model evaluations focus on general accuracy but overlook nuanced capabilities like recognizing rare grammar points in non-English languages, as demonstrated by testing Japanese psych predicate restrictions where Weblab outperforms others due to uniform tokenization, while Llama 3's performance improves significantly with optimized tokenization.

Authors:Xinyao Liao, Wei Wei, Xiaoye Qu, Yu Cheng
Title: Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning
Abstract:
Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step's contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.
中文摘要:本研究提出一种信用分配框架,通过动态分配去噪步骤的密集奖励,在不改变原始策略的前提下,显著提升了文本到图像扩散模型的训练效率和泛化能力。
English Summary: This study introduces a credit assignment framework that dynamically distributes dense rewards across denoising steps in text-to-image diffusion models, significantly improving training efficiency and generalization without altering the original policy.

Authors:Yusheng Zhao, Qixin Zhang, Xiao Luo, Weizhi Zhang, Zhiping Xiao, Wei Ju, Philip S. Yu, Ming Zhang
Title: Dynamic Text Bundling Supervision for Zero-Shot Inference on Text-Attributed Graphs
Abstract:
Large language models (LLMs) have been used in many zero-shot learning problems, with their strong generalization ability. Recently, adopting LLMs in text-attributed graphs (TAGs) has drawn increasing attention. However, the adoption of LLMs faces two major challenges: limited information on graph structure and unreliable responses. LLMs struggle with text attributes isolated from the graph topology. Worse still, they yield unreliable predictions due to both information insufficiency and the inherent weakness of LLMs (e.g., hallucination). Towards this end, this paper proposes a novel method named Dynamic Text Bundling Supervision (DENSE) that queries LLMs with bundles of texts to obtain bundle-level labels and uses these labels to supervise graph neural networks. Specifically, we sample a set of bundles, each containing a set of nodes with corresponding texts of close proximity. We then query LLMs with the bundled texts to obtain the label of each bundle. Subsequently, the bundle labels are used to supervise the optimization of graph neural networks, and the bundles are further refined to exclude noisy items. To justify our design, we also provide theoretical analysis of the proposed method. Extensive experiments across ten datasets validate the effectiveness of the proposed method.
中文: 本文提出DENSE方法,通过将邻近节点的文本打包输入大语言模型获取标签来监督图神经网络,利用动态捆绑机制解决结构信息缺失和模型不可靠问题,并在十个数据集上验证了其有效性。
English: This paper introduces DENSE, a method that queries large language models with text bundles to generate labels for supervising graph neural networks, addressing challenges of limited structural information and unreliable LLM predictions through dynamic bundling and theoretical validation across multiple datasets.

Authors:Yusheng Zhao, Qixin Zhang, Xiao Luo, Weizhi Zhang, Zhiping Xiao, Wei Ju, Philip S. Yu, Ming Zhang
Title: Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs
Abstract:
Large language models (LLMs) have been used in many zero-shot learning problems, with their strong generalization ability. Recently, adopting LLMs in text-attributed graphs (TAGs) has drawn increasing attention. However, the adoption of LLMs faces two major challenges: limited information on graph structure and unreliable responses. LLMs struggle with text attributes isolated from the graph topology. Worse still, they yield unreliable predictions due to both information insufficiency and the inherent weakness of LLMs (e.g., hallucination). Towards this end, this paper proposes a novel method named Dynamic Text Bundling Supervision (DENSE) that queries LLMs with bundles of texts to obtain bundle-level labels and uses these labels to supervise graph neural networks. Specifically, we sample a set of bundles, each containing a set of nodes with corresponding texts of close proximity. We then query LLMs with the bundled texts to obtain the label of each bundle. Subsequently, the bundle labels are used to supervise the optimization of graph neural networks, and the bundles are further refined to exclude noisy items. To justify our design, we also provide theoretical analysis of the proposed method. Extensive experiments across ten datasets validate the effectiveness of the proposed method.
中文: 本文提出DENSE方法,通过将邻近节点的文本打包输入大语言模型获取标签来监督图神经网络,利用动态捆绑机制解决结构信息缺失和模型不可靠问题,并在十个数据集上验证了其有效性。
English: This paper introduces DENSE, a method that queries large language models with text bundles to generate labels for supervising graph neural networks, addressing challenges of limited structural information and unreliable LLM predictions through dynamic bundling and theoretical validation across multiple datasets.

Authors:Yusheng Zhao, Xiao Luo, Weizhi Zhang, Wei Ju, Zhiping Xiao, Philip S. Yu, Ming Zhang
Title: MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning
Abstract:
The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem-solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem-solving using frozen LLMs. In contrast, we adopt a cognitive-evolving perspective and propose a novel framework named Meta-Reflection with Cross-Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self-improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem-solving, we propose meta-reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross-referencing that incorporates the solution and feedback from other agents into the current problem-solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.
中文摘要:本文提出MARCO框架,通过元反思积累知识、交叉参考借鉴其他智能体方案,使大语言模型在推理过程中能够自我进化,逐步提升代码推理能力。
English Summary: This paper introduces MARCO, a framework that enables large language models to progressively improve their code reasoning skills through self-reflection and cross-referencing solutions from other agents during inference.

Authors:Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
Abstract:
Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.
中文摘要:本研究揭示了Transformer模型通过上下文学习发展元学习能力的过程,发现其训练包含多个阶段且每个阶段形成独特电路结构,这与简单的归纳头机制有所不同。
English Summary: This study investigates how transformer models develop meta-learning capabilities through in-context learning, revealing multiple training phases with distinct circuit formations that differ from simple induction head mechanisms.

Authors:Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Title: AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Abstract:
Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.
中文: 大规模强化学习通过先数学后代码的顺序训练策略,显著提升了中小型模型的推理能力,超越了基于蒸馏的方法。
English: Large-scale reinforcement learning significantly enhances reasoning capabilities in small- and mid-sized models, surpassing distillation-based approaches through a sequential training strategy of math-only followed by code-only prompts.

Authors:Haochen Shi, Tianshi Zheng, Weiqi Wang, Baixuan Xu, Chunyang Li, Chunkit Chan, Tao Fan, Yangqiu Song, Qiang Yang
Title: INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling
Abstract:
Large Language Model (LLM) routing is a pivotal technique for navigating a diverse landscape of LLMs, aiming to select the best-performing LLMs tailored to the domains of user queries, while managing computational resources. However, current routing approaches often face limitations in scalability when dealing with a large pool of specialized LLMs, or in their adaptability to extending model scope and evolving capability domains. To overcome those challenges, we propose InferenceDynamics, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. We operate it on our comprehensive dataset RouteMix, and demonstrate its effectiveness and generalizability in group-level routing using modern benchmarks including MMLU-Pro, GPQA, BigGenBench, and LiveBench, showcasing its ability to identify and leverage top-performing models for given tasks, leading to superior outcomes with efficient resource utilization. The broader adoption of Inference Dynamics can empower users to harness the full specialized potential of the LLM ecosystem, and our code will be made publicly available to encourage further research.
中文:InferenceDynamics是一个可扩展的多维路由框架,能有效选择最适合特定任务的大型语言模型,从而提升性能并优化资源利用。
English: InferenceDynamics is a scalable multi-dimensional routing framework that effectively selects top-performing LLMs for specific tasks, enhancing outcomes and resource efficiency.

Authors:Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yongwei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, Wei Shen
Title: NAN: A Training-Free Solution to Coefficient Estimation in Model Merging
Abstract:
Model merging offers a training-free alternative to multi-task learning by combining independently fine-tuned models into a unified one without access to raw data. However, existing approaches often rely on heuristics to determine the merging coefficients, limiting their scalability and generality. In this work, we revisit model merging through the lens of least-squares optimization and show that the optimal merging weights should scale with the amount of task-specific information encoded in each model. Based on this insight, we propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm. NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies. Extensive experiments on show that NAN consistently improves performance of baseline methods.
Chinese: 模型融合是一种无需训练即可将微调模型整合为统一模型的方法,而提出的NAN方法通过参数范数倒数优化融合系数,显著提升了多种基线方法的性能。
English: Model merging is a training-free approach that combines fine-tuned models into a unified one, and the proposed NAN method optimizes merging coefficients using parameter norm inverses to enhance performance across various strategies.

Authors:Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, Yangqiu Song
Title: EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association
Abstract:
Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.
中文摘要:本文提出了电子商务脚本规划(EcomScript)新框架,通过将行动与购买意图对齐来生成产品丰富的购物脚本,并创建首个大规模基准数据集,以解决当前基于大语言模型的购物助手存在的不足。
English Summary: This paper introduces E-commerce Script Planning (EcomScript), a novel framework that generates product-enriched shopping scripts by aligning actions with purchase intentions, and creates the first large-scale benchmark dataset to address current limitations in LLM-based shopping assistants.

Authors:Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Wenbo Su, Bo Zheng, Jiaheng Liu
Title: Think-J: Learning to Think for Generative LLM-as-a-Judge
Abstract:
LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline RL requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.
中文: 提出的Think-J方法通过少量精选数据培养初步判断思维,并采用离线和在线强化学习进行优化,无需额外人工标注即可显著提升生成式LLM作为评判者的评估能力。
English: The proposed Think-J method enhances generative LLM-as-a-Judge by developing initial judgment thinking with curated data and optimizing it through offline and online reinforcement learning, significantly improving evaluation capabilities without additional human input.

Authors:Wei Fan, Tianshi Zheng, Yiran Hu, Zheye Deng, Weiqi Wang, Baixuan Xu, Chunyang Li, Haoran Li, Weixing Shen, Yangqiu Song
Title: Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents
Abstract:
Legal rules encompass not only codified statutes but also implicit adjudicatory principles derived from precedents that contain discretionary norms, social morality, and policy. While computational legal research has advanced in applying established rules to cases, inducing legal rules from judicial decisions remains understudied, constrained by limitations in model inference efficacy and symbolic reasoning capability. The advent of Large Language Models (LLMs) offers unprecedented opportunities for automating the extraction of such latent principles, yet progress is stymied by the absence of formal task definitions, benchmark datasets, and methodologies. To address this gap, we formalize Legal Rule Induction (LRI) as the task of deriving concise, generalizable doctrinal rules from sets of analogous precedents, distilling their shared preconditions, normative behaviors, and legal consequences. We introduce the first LRI benchmark, comprising 5,121 case sets (38,088 Chinese cases in total) for model tuning and 216 expert-annotated gold test sets. Experimental results reveal that: 1) State-of-the-art LLMs struggle with over-generalization and hallucination; 2) Training on our dataset markedly enhances LLMs capabilities in capturing nuanced rule patterns across similar cases.
中文: 本研究将法律规则归纳定义为从类似判例中提取原则性规则,并创建了首个基准测试,实验表明大语言模型虽存在过度概括问题,但经数据集训练后能显著提升对案例细微规则的识别能力。
English: This study formalizes Legal Rule Induction (LRI) as extracting doctrinal rules from precedents and introduces a benchmark where large language models initially struggle with over-generalization but show improved accuracy after training on the dataset.

Authors:Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim
Title: Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning
Abstract:
Retrieval-Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions. To address these challenges, this paper propose SPLIT-RAG, a multi-agent RAG framework that addresses these limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG. The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.
中文摘要:SPLIT-RAG提出多智能体框架,通过语义化分割知识图谱并部署专项智能体,有效提升复杂查询的检索效率与准确性。
English Summary: SPLIT-RAG introduces a multi-agent framework that partitions knowledge graphs into semantic subgraphs and deploys specialized agents to enhance retrieval efficiency and accuracy for complex queries.

Authors:Yiyuan Yang, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang, Chengqi Zhang
Title: Federated Low-Rank Adaptation for Foundation Models: A Survey
Abstract:
Effectively leveraging private datasets remains a significant challenge in developing foundation models. Federated Learning (FL) has recently emerged as a collaborative framework that enables multiple users to fine-tune these models while mitigating data privacy risks. Meanwhile, Low-Rank Adaptation (LoRA) offers a resource-efficient alternative for fine-tuning foundation models by dramatically reducing the number of trainable parameters. This survey examines how LoRA has been integrated into federated fine-tuning for foundation models, an area we term FedLoRA, by focusing on three key challenges: distributed learning, heterogeneity, and efficiency. We further categorize existing work based on the specific methods used to address each challenge. Finally, we discuss open research questions and highlight promising directions for future investigation, outlining the next steps for advancing FedLoRA.
中文摘要:本综述探讨了将低秩适应(LoRA)融入基础模型的联邦微调(称为FedLoRA)的方法,针对分布式学习、异构性和效率等关键挑战,并展望了未来的研究方向。
English Summary: This survey explores the integration of Low-Rank Adaptation (LoRA) into federated fine-tuning for foundation models, termed FedLoRA, addressing challenges in distributed learning, heterogeneity, and efficiency while outlining future research directions.

Authors:Takeshi Kojima, Yaonan Zhu, Yusuke Iwasawa, Toshinori Kitamura, Gang Yan, Shu Morikuni, Ryosuke Takanami, Alfredo Solano, Tatsuya Matsushima, Akiko Murakami, Yutaka Matsuo
Title: A Comprehensive Survey on Physical Risk Control in the Era of Foundation Model-enabled Robotics
Abstract:
Recent Foundation Model-enabled robotics (FMRs) display greatly improved general-purpose skills, enabling more adaptable automation than conventional robotics. Their ability to handle diverse tasks thus creates new opportunities to replace human labor. However, unlike general foundation models, FMRs interact with the physical world, where their actions directly affect the safety of humans and surrounding objects, requiring careful deployment and control. Based on this proposition, our survey comprehensively summarizes robot control approaches to mitigate physical risks by covering all the lifespan of FMRs ranging from pre-deployment to post-accident stage. Specifically, we broadly divide the timeline into the following three phases: (1) pre-deployment phase, (2) pre-incident phase, and (3) post-incident phase. Throughout this survey, we find that there is much room to study (i) pre-incident risk mitigation strategies, (ii) research that assumes physical interaction with humans, and (iii) essential issues of foundation models themselves. We hope that this survey will be a milestone in providing a high-resolution analysis of the physical risks of FMRs and their control, contributing to the realization of a good human-robot relationship.
中文: 近期基础模型赋能的机器人技术虽提升了通用性和任务多样性,却因直接作用于物理世界而存在安全风险;本综述系统梳理其全生命周期风险管控方法,并指出需重点研究人机物理交互安全等关键领域,以促进良性人机关系发展。
English: Recent Foundation Model-enabled robotics (FMRs) offer enhanced adaptability and task diversity but pose physical safety risks, prompting this survey to comprehensively analyze risk mitigation strategies across their lifecycle and identify key research gaps for safer human-robot interactions.

Authors:Yuncheng Hua, Ji Miao, Mehdi Jafari, Jianxiang Xie, Hao Xue, Flora D. Salim
Title: SOCIA: An End-to-End Agentic Framework for Automated Cyber-Physical-Social Simulator Generation
Abstract:
This paper introduces SOCIA (Simulation Orchestration for Cyber-physical-social Intelligence and Agents), a novel end-to-end framework leveraging Large Language Model (LLM)-based multi-agent systems to automate the generation of high-fidelity Cyber-Physical-Social (CPS) simulators. Addressing the challenges of labor-intensive manual simulator development and complex data calibration, SOCIA integrates a centralized orchestration manager that coordinates specialized agents for tasks including data comprehension, code generation, simulation execution, and iterative evaluation-feedback loops. Through empirical evaluations across diverse CPS tasks, such as mask adoption behavior simulation (social), personal mobility generation (physical), and user modeling (cyber), SOCIA demonstrates its ability to produce high-fidelity, scalable simulations with reduced human intervention. These results highlight SOCIA's potential to offer a scalable solution for studying complex CPS phenomena
中文摘要:本文提出SOCIA框架,利用基于大语言模型的多智能体系统自动生成高保真信息物理社会模拟器,通过多领域实证评估证明其能以最少人工干预创建可扩展的仿真系统。
English Summary: This paper presents SOCIA, an end-to-end framework using LLM-based multi-agent systems to automate the creation of high-fidelity Cyber-Physical-Social simulators, demonstrating through empirical evaluations its effectiveness in generating scalable simulations with minimal human intervention across diverse applications.

Authors:Xvyuan Liu, Xiangfei Qiu, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Jilin Hu, Bin Yang
Title: Rethinking Irregular Time Series Forecasting: A Simple yet Effective Baseline
Abstract:
The forecasting of irregular multivariate time series (IMTS) is a critical task in domains like healthcare and climate science. However, this task faces two significant hurdles: 1) the inherent non-uniformity and missing data in IMTS complicate the modeling of temporal dynamics, and 2) existing methods often rely on computationally expensive architectures. To address these dual challenges, we introduce APN, a general and efficient forecasting framework. At the core of APN is a novel Time-Aware Patch Aggregation (TAPA) module that introduces an aggregation-based paradigm for adaptive patching, moving beyond the limitations of fixed-span segmentation and interpolation-based methods. TAPA first learns dynamic temporal boundaries to define data-driven segments. Crucially, instead of resampling or interpolating, it directly computes patch representations via a time-aware weighted aggregation of all raw observations, where weights are determined by each observation's temporal relevance to the segment. This approach provides two key advantages: it preserves data fidelity by avoiding the introduction of artificial data points and ensures complete information coverage by design.The resulting regularized and information-rich patch representations enable the use of a lightweight query module for historical context aggregation and a simple MLP for final prediction. Extensive experiments on multiple real-world datasets demonstrate that APN establishes a new state-of-the-art, significantly outperforming existing methods in both prediction accuracy and computational efficiency.
中文: APN框架通过创新的时间感知片段聚合模块,动态生成数据驱动的分段并直接聚合原始观测值而非插值,有效解决了不规则多元时间序列预测的难题,在精度和计算效率上均实现了最优表现。
English: The APN framework introduces a novel Time-Aware Patch Aggregation module that overcomes irregular multivariate time series forecasting challenges by dynamically creating data-driven segments and aggregating raw observations without interpolation, achieving superior accuracy and computational efficiency.

Authors:Xiao Tang, Huirong Xiao, Chao Shen, Li Sun, Qinghe Du, Dusit Niyato, Zhu Han
Title: Unfolded Deep Graph Learning for Networked Over-the-Air Computation
Abstract:
Over-the-air computation (AirComp) has emerged as a promising technology that enables simultaneous transmission and computation through wireless channels. In this paper, we investigate the networked AirComp in multiple clusters allowing diversified data computation, which is yet challenged by the transceiver coordination and interference management therein. Particularly, we aim to maximize the multi-cluster weighted-sum AirComp rate, where the transmission scalar as well as receive beamforming are jointly investigated while addressing the interference issue. From an optimization perspective, we decompose the formulated problem and adopt the alternating optimization technique with an iterative process to approximate the solution. Then, we reinterpret the iterations through the principle of algorithm unfolding, where the channel condition and mutual interference in the AirComp network constitute an underlying graph. Accordingly, the proposed unfolding architecture learns the weights parameterized by graph neural networks, which is trained through stochastic gradient descent approach. Simulation results show that our proposals outperform the conventional schemes, and the proposed unfolded graph learning substantially alleviates the interference and achieves superior computation performance, with strong and efficient adaptation to the dynamic and scalable networks.
中文: 本文提出了一种基于展开图学习的多集群空中计算方法,通过联合优化传输标量和接收波束成形来减轻干扰,仿真结果表明该方法在动态网络中具有优越的计算性能和强大的适应性。
English: This paper proposes an unfolded graph learning approach for multi-cluster over-the-air computation that jointly optimizes transmission scalars and receive beamforming to mitigate interference, demonstrating superior performance and adaptability in dynamic networks through simulations.

Authors:Letian Wang, Marc-Antoine Lavoie, Sandro Papais, Barza Nisar, Yuxiao Chen, Wenhao Ding, Boris Ivanovic, Hao Shao, Abulikemu Abuduweili, Evan Cook, Yang Zhou, Peter Karkus, Jiachen Li, Changliu Liu, Marco Pavone, Steven Waslander
Title: Trends in Motion Prediction Toward Deployable and Generalizable Autonomy: A Revisit and Perspectives
Abstract:
Motion prediction, the anticipation of future agent states or scene evolution, is rooted in human cognition, bridging perception and decision-making. It enables intelligent systems, such as robots and self-driving cars, to act safely in dynamic, human-involved environments, and informs broader time-series reasoning challenges. With advances in methods, representations, and datasets, the field has seen rapid progress, reflected in quickly evolving benchmark results. Yet, when state-of-the-art methods are deployed in the real world, they often struggle to generalize to open-world conditions and fall short of deployment standards. This reveals a gap between research benchmarks, which are often idealized or ill-posed, and real-world complexity. To address this gap, this survey revisits the generalization and deployability of motion prediction models, with an emphasis on the applications of robotics, autonomous driving, and human motion. We first offer a comprehensive taxonomy of motion prediction methods, covering representations, modeling strategies, application domains, and evaluation protocols. We then study two key challenges: (1) how to push motion prediction models to be deployable to realistic deployment standards, where motion prediction does not act in a vacuum, but functions as one module of closed-loop autonomy stacks - it takes input from the localization and perception, and informs downstream planning and control. 2) how to generalize motion prediction models from limited seen scenarios/datasets to the open-world settings. Throughout the paper, we highlight critical open challenges to guide future work, aiming to recalibrate the community's efforts, fostering progress that is not only measurable but also meaningful for real-world applications. The project webpage corresponding to this paper can be found here https://trends-in-motion-prediction-2025.github.io/.
中文: 本综述探讨运动预测模型的泛化性与部署能力,旨在弥合理想化基准与现实世界复杂性之间的差距,重点关注机器人学和自动驾驶等应用领域。
English: This survey examines the generalization and deployability of motion prediction models, addressing the gap between idealized benchmarks and real-world complexity in applications like robotics and autonomous driving.

Authors:Shixi Qin, Zhiyong Yang, Shilong Bao, Shi Wang, Qianqian Xu, Qingming Huang
Title: MixBridge: Heterogeneous Image-to-Image Backdoor Attack through Mixture of Schrödinger Bridges
Abstract:
This paper focuses on implanting multiple heterogeneous backdoor triggers in bridge-based diffusion models designed for complex and arbitrary input distributions. Existing backdoor formulations mainly address single-attack scenarios and are limited to Gaussian noise input models. To fill this gap, we propose MixBridge, a novel diffusion Schrödinger bridge (DSB) framework to cater to arbitrary input distributions (taking I2I tasks as special cases). Beyond this trait, we demonstrate that backdoor triggers can be injected into MixBridge by directly training with poisoned image pairs. This eliminates the need for the cumbersome modifications to stochastic differential equations required in previous studies, providing a flexible tool to study backdoor behavior for bridge models. However, a key question arises: can a single DSB model train multiple backdoor triggers? Unfortunately, our theory shows that when attempting this, the model ends up following the geometric mean of benign and backdoored distributions, leading to performance conflict across backdoor tasks. To overcome this, we propose a Divide-and-Merge strategy to mix different bridges, where models are independently pre-trained for each specific objective (Divide) and then integrated into a unified model (Merge). In addition, a Weight Reallocation Scheme (WRS) is also designed to enhance the stealthiness of MixBridge. Empirical studies across diverse generation tasks speak to the efficacy of MixBridge.
中文摘要:本文提出MixBridge这一扩散薛定谔桥框架,可在处理任意输入分布的模型中植入多种异构后门触发器,通过分治合并策略解决不同后门任务间的性能冲突,并设计权重重分配方案增强隐蔽性,突破了现有单攻击方法的局限性。
English Summary: This paper introduces MixBridge, a diffusion Schrödinger bridge framework that enables the implantation of multiple heterogeneous backdoor triggers in models handling arbitrary input distributions, overcoming limitations of existing single-attack methods while employing a Divide-and-Merge strategy to resolve performance conflicts between different backdoor tasks.

Authors:Baixuan Xu, Chunyang Li, Weiqi Wang, Wei Fan, Tianshi Zheng, Haochen Shi, Tao Fan, Yangqiu Song, Qiang Yang
Title: Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study
Abstract:
Designing effective collaboration structure for multi-agent LLM systems to enhance collective reasoning is crucial yet remains under-explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise-Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity-driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain-contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi-agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi-agent system and identifies critical architectural trade-offs and bottlenecks for scalable multi-agent reasoning. The code will be made available upon acceptance.
中文: 本研究系统探讨了专业领域对齐、协作范式与系统规模对多智能体推理的影响,发现领域专业化与知识多样性整合优于刚性任务分解,同时揭示了扩展性瓶颈与通信协议优化的必要性。
English: This study systematically examines how expertise alignment, collaboration paradigms, and system scale affect multi-agent LLM reasoning, finding that domain-specific expertise and diverse knowledge integration outperform rigid workflows while highlighting scalability challenges.

Authors:Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu
Title: T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction
Abstract:
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.
中文摘要:本研究提出了一种表格Transformer(T-T)模型,采用创新的条纹注意力机制与循环移位策略,有效解决了基于标注的方面情感三元组提取任务,以更低计算成本实现了最优性能。
English Summary: This study introduces a Table-Transformer (T-T) model that employs a novel stripe attention mechanism with loop-shift strategy to efficiently handle aspect sentiment triplet extraction, achieving state-of-the-art performance with reduced computational costs.

Authors:Hao Peng, Xiang Huang, Shuo Sun, Ruitong Zhang, Philip S. Yu
Title: Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning
Abstract:
DBSCAN, a well-known density-based clustering algorithm, has gained widespread popularity and usage due to its effectiveness in identifying clusters of arbitrary shapes and handling noisy data. However, it encounters challenges in producing satisfactory cluster results when confronted with datasets of varying density scales, a common scenario in real-world applications. In this paper, we propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN. First, we model the initial dataset as a two-level encoding tree and categorize the data vertices into distinct density partitions according to the information uncertainty determined in the encoding tree. Each partition is then assigned to an agent to find the best clustering parameters without manual assistance. The allocation is density-adaptive, enabling AR-DBSCAN to effectively handle diverse density distributions within the dataset by utilizing distinct agents for different partitions. Second, a multi-agent deep reinforcement learning guided automatic parameter searching process is designed. The process of adjusting the parameter search direction by perceiving the clustering environment is modeled as a Markov decision process. Using a weakly-supervised reward training policy network, each agent adaptively learns the optimal clustering parameters by interacting with the clusters. Third, a recursive search mechanism adaptable to the data's scale is presented, enabling efficient and controlled exploration of large parameter spaces. Extensive experiments are conducted on nine artificial datasets and a real-world dataset. The results of offline and online tasks show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
中文: 提出的AR-DBSCAN框架通过多智能体强化学习自动确定不同密度分区的最佳聚类参数,相比传统DBSCAN显著提升了聚类准确性和鲁棒性。
English: The proposed AR-DBSCAN framework uses multi-agent reinforcement learning to automatically determine optimal clustering parameters across varying density partitions, significantly improving accuracy and robustness over traditional DBSCAN.

Authors:Feibo Jiang, Cunhua Pan, Li Dong, Kezhi Wang, Merouane Debbah, Dusit Niyato, Zhu Han
Title: A Comprehensive Survey of Large AI Models for Future Communications: Foundations, Applications and Challenges
Abstract:
The 6G wireless communications aim to establish an intelligent world of ubiquitous connectivity, providing an unprecedented communication experience. Large artificial intelligence models (LAMs) are characterized by significantly larger scales (e.g., billions or trillions of parameters) compared to typical artificial intelligence (AI) models. LAMs exhibit outstanding cognitive abilities, including strong generalization capabilities for fine-tuning to downstream tasks, and emergent capabilities to handle tasks unseen during training. Therefore, LAMs efficiently provide AI services for diverse communication applications, making them crucial tools for addressing complex challenges in future wireless communication systems. This study provides a comprehensive review of the foundations, applications, and challenges of LAMs in communication. First, we introduce the current state of AI-based communication systems, emphasizing the motivation behind integrating LAMs into communications and summarizing the key contributions. We then present an overview of the essential concepts of LAMs in communication. This includes an introduction to the main architectures of LAMs, such as transformer, diffusion models, and mamba. We also explore the classification of LAMs, including large language models (LLMs), large vision models (LVMs), large multimodal models (LMMs), and world models, and examine their potential applications in communication. Additionally, we cover the training methods and evaluation techniques for LAMs in communication systems. Lastly, we introduce optimization strategies such as chain of thought (CoT), retrieval augmented generation (RAG), and agentic systems. Following this, we discuss the research advancements of LAMs across various communication scenarios. Finally, we analyze the challenges in the current research and provide insights into potential future research directions.
中文: 6G致力于构建一个智能互联世界,其中拥有数十亿参数的大型人工智能模型凭借其强大的泛化能力和涌现特性,为各类通信应用提供关键的AI服务。
English: 6G aims to create an intelligent world with ubiquitous connectivity, where large AI models (LAMs) with billions of parameters provide crucial AI services for diverse communication applications by leveraging their strong generalization and emergent capabilities.

Authors:Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang
Title: Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Abstract:
Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs' safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies.
中文摘要:ICRT框架借鉴人类认知启发与偏见,通过认知分解降低恶意提示复杂性,并利用关联偏见重组提示以增强语义对齐,同时采用基于排序的危害性评估指标,实验证明其能有效突破主流大语言模型的安全防护并生成高风险内容。
English Summary: The ICRT framework leverages cognitive biases like the simplicity effect and relevance bias to effectively jailbreak LLMs by simplifying malicious prompts and enhancing semantic alignment, while introducing a ranking-based metric to better evaluate harmfulness.

Authors:Sicong Li, Qianqian Xu, Zhiyong Yang, Zitai Wang, Linchao Zhang, Xiaochun Cao, Qingming Huang
Title: Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification
Abstract:
Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.
中文摘要:Focal-SAM通过为不同类别分配锐度惩罚,在不增加反向传播的情况下实现了细粒度控制,有效解决了长尾学习中计算效率与控制精度之间的权衡问题。
English Summary: Focal-SAM is introduced to address the trade-off between computational efficiency and fine-grained control in long-tailed learning by assigning class-wise sharpness penalties without requiring additional backpropagations, achieving both effectiveness and efficiency.

Authors:Gaozheng Pei, Ke Ma, Yingfei Sun, Qianqian Xu, Qingming Huang
Title: Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain
Abstract:
The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image's phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.
中文摘要: 该方法基于频域视角,通过替换低频振幅分量和约束相位投影范围,在消除对抗性扰动的同时最大程度保留原始图像内容与结构,实验证明其防御效果显著优于现有方法。
English Summary: The proposed adversarial purification method leverages frequency domain analysis to selectively replace low-frequency amplitude components and constrain phase projections, effectively removing adversarial perturbations while preserving image content and structure better than existing techniques.

Authors:Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu
Title: AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
Abstract:
Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.
中文: 我们提出了AdCare-VLM多模态模型,通过临床视频数据微调,利用视觉问答提升药物依从性监测,其准确率显著优于现有方法。
English: We introduce AdCare-VLM, a multimodal model fine-tuned on clinical video data to enhance medication adherence monitoring through visual question answering, outperforming existing methods with significant accuracy improvements.

Authors:Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu
Title: ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
Abstract:
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.
Chinese: ReasonGen-R1 将思维链推理与强化学习结合到生成式视觉模型中,通过基于新构建推理数据集的监督微调和群组相对策略优化来提升图像生成质量,在多个基准测试中均优于现有先进模型。
English: ReasonGen-R1 integrates chain-of-thought reasoning and reinforcement learning into a generative vision model, using supervised fine-tuning on a reasoning dataset and Group Relative Policy Optimization to enhance image generation, achieving state-of-the-art performance on multiple benchmarks.

Authors:Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang
Title: Adversarial Preference Learning for Robust LLM Alignment
Abstract:
Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.
中文: 现代语言模型依赖人类反馈强化学习确保安全,但仍易受对抗攻击,而提出的对抗偏好学习方法通过引入自动化危害度量、条件生成攻击器和迭代反馈框架,显著提升了鲁棒性,大幅减少有害输出并保持模型实用性。
English: Modern language models using RLHF for safety remain vulnerable to adversarial attacks due to costly human annotation, diverse threats, and feedback bias, but the proposed Adversarial Preference Learning (APL) method significantly enhances robustness by incorporating automated harmfulness metrics, generative attackers, and iterative feedback, reducing harmful outputs and maintaining competitive performance.

Authors:Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe
Title: Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws
Abstract:
Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers' outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.
中文摘要:通过偏差-多样性分解研究发现,Transformer 性能提升依赖于新增层的多样性行为,当各层输出相互差异且接近正确答案时性能最优,且层数增加带来的边际收益递减符合参数缩放规律。
English Summary: Transformer performance improves with increased layers only when these layers exhibit diverse behaviors, as shown through a bias-diversity decomposition that reveals diminishing returns align with parameter scaling laws.

Authors:Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang
Title: Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Abstract:
Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful and expressive medium for representing executable symbolic operations, encoding rich reasoning patterns that remain largely underutilized. In this paper, we propose Formula Tuning (Fortune), a reinforcement learning (RL) framework that trains LMs to generate executable spreadsheet formulas for question answering over general tabular data. Formula Tuning reduces the reliance on supervised formula annotations by using binary answer correctness as a reward signal, guiding the model to learn formula derivation through reasoning. We provide a theoretical analysis of its advantages and demonstrate its effectiveness through extensive experiments on seven table reasoning benchmarks. Formula Tuning substantially enhances LM performance, particularly on multi-step numerical and symbolic reasoning tasks, enabling a 7B model to outperform OpenAI o1 on table understanding. This highlights the potential of formula-driven RL to advance symbolic table reasoning in LMs.
Chinese: 公式调优是一种强化学习框架,通过训练语言模型生成电子表格公式进行表格推理,无需大量监督数据即可显著提升模型在复杂数值与符号任务上的性能表现。
English: Formula Tuning is a reinforcement learning framework that trains language models to generate spreadsheet formulas for table reasoning, significantly enhancing their performance on complex numerical and symbolic tasks without extensive supervised data.

Authors:Peizheng Guo, Jingyao Wang, Huijie Guo, Jiangmeng Li, Chuxiong Sun, Changwen Zheng, Wenwen Qiang
Title: Multi-Modal Learning with Bayesian-Oriented Gradient Calibration
Abstract:
Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality's gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.
中文摘要:提出的BOGC-MML方法通过贝叶斯概率分布建模梯度不确定性,并基于证据校准实现维度特异性优化更新,有效解决了多模态学习中梯度权重固定的问题,在多个基准数据集上验证了其优越性能。
English Summary: The proposed BOGC-MML method addresses limitations in Multi-Modal Learning by modeling gradient uncertainty through Bayesian probability distributions and evidence-based calibration, resulting in optimized dimension-specific updates that enhance predictive performance across benchmark datasets.

Authors:Liangkai Hang, Junjie Yao, Zhiwei Bai, Tianyi Chen, Yang Chen, Rongjie Diao, Hezhou Li, Pengxiao Lin, Zhiwei Wang, Cheng Xu, Zhongwang Zhang, Zhangchen Zhou, Zhiyu Li, Zehao Lin, Kai Chen, Feiyu Xiong, Yaoyu Zhang, Weinan E, Hongkang Yang, Zhi-Qin John Xu
Title: Scalable Complexity Control Facilitates Reasoning Ability of LLMs
Abstract:
The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.
中文摘要:通过调整初始化速率和权重衰减系数实现的模型复杂度控制,能够持续改进大语言模型在不同规模和数据量下的扩展规律,为提升其推理能力提供了可行方向。
English Summary: Model complexity control, achieved by adjusting initialization rate and weight decay, consistently improves LLM scaling laws across different model and data sizes, offering a promising direction for advancing reasoning capabilities.

Authors:Qingchen Yu, Zifan Zheng, Ding Chen, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Title: GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Abstract:
The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
中文:GuessArena提出了一种基于对抗性游戏的动态评估框架,通过整合领域知识建模与渐进式推理测试,有效解决了传统静态基准在适应性和细粒度评估上的不足,显著提升了可解释性与场景适应性。
English: GuessArena introduces an adaptive, game-based evaluation framework that overcomes the limitations of static benchmarks by dynamically assessing domain knowledge and reasoning across multiple fields, offering superior interpretability and scalability.

Authors:Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, Feiyu Xiong
Title: MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
Abstract:
Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.
Chinese: 当前大型语言模型缺乏统一的结构化内存架构,限制了长期知识演进能力,而MemOS通过引入统一的内存操作系统,将内存提升为核心资源,增强了智能系统的可控性和适应性。
English: Large Language Models currently lack a structured memory architecture, leading to limitations in long-term knowledge evolution, which MemOS addresses by introducing a unified memory operating system that elevates memory to a first-class resource for enhanced controllability and adaptability in intelligent systems.

Authors:Wenwen Qiang, Ziyin Gu, Lingyu Si, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong
Title: On the Transferability and Discriminability of Repersentation Learning in Unsupervised Domain Adaptation
Abstract:
In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined "good representation learning" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.
中文: 本文提出了一种新颖的对抗性无监督领域自适应框架,通过将领域对齐与可判别性增强相结合,解决了传统方法忽视目标领域特征可判别性的局限,实验证明其性能优于现有最优方法。
English: This paper proposes a novel adversarial unsupervised domain adaptation framework that integrates domain alignment with discriminability enhancement, demonstrating superior performance by addressing the limitations of traditional methods that overlook target-domain feature discriminability.

Authors:Tonghe Zhang, Chao Yu, Sichang Su, Yu Wang
Title: ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
Abstract:
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
中文: ReinFlow是一个在线强化学习框架,通过注入可学习噪声来微调连续机器人控制中的流匹配策略,实现稳定的训练和探索,在减少去噪步骤的同时相比现有方法获得了显著的性能提升。
English: ReinFlow is an online reinforcement learning framework that fine-tunes flow matching policies for continuous robotic control by injecting learnable noise to enable stable training and exploration, achieving significant performance improvements with fewer denoising steps compared to existing methods.

Authors:Bo Tang, Junyi Zhu, Chenyang Xi, Yunhang Ge, Jiahao Wu, Yuchen Feng, Yijun Niu, Wenqiang Wei, Yu Yu, Chunyu Li, Zehao Lin, Hao Wu, Ning Liao, Yebin Yang, Jiajia Wang, Zhiyu Li, Feiyu Xiong, Jingrun Chen
Title: Xinyu AI Search: Enhanced Relevance and Comprehensive Results with Rich Answer Presentations
Abstract:
Traditional search engines struggle to synthesize fragmented information for complex queries, while generative AI search engines face challenges in relevance, comprehensiveness, and presentation. To address these limitations, we introduce Xinyu AI Search, a novel system that incorporates a query-decomposition graph to dynamically break down complex queries into sub-queries, enabling stepwise retrieval and generation. Our retrieval pipeline enhances diversity through multi-source aggregation and query expansion, while filtering and re-ranking strategies optimize passage relevance. Additionally, Xinyu AI Search introduces a novel approach for fine-grained, precise built-in citation and innovates in result presentation by integrating timeline visualization and textual-visual choreography. Evaluated on recent real-world queries, Xinyu AI Search outperforms eight existing technologies in human assessments, excelling in relevance, comprehensiveness, and insightfulness. Ablation studies validate the necessity of its key sub-modules. Our work presents the first comprehensive framework for generative AI search engines, bridging retrieval, generation, and user-centric presentation.
中文:Xinyu AI搜索通过查询分解图实现逐步检索与生成、增强的多源检索与重排序策略,以及创新的引用和呈现功能,克服了传统与生成式AI搜索引擎的局限,在相关性、全面性和洞察力方面表现卓越。
English: Xinyu AI Search overcomes limitations of traditional and generative AI search engines by employing a query-decomposition graph for stepwise retrieval and generation, enhanced multi-source retrieval with filtering and re-ranking, and innovative citation and presentation features, achieving superior performance in relevance, comprehensiveness, and insightfulness.

Authors:Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Title: Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
Abstract:
Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
大语言模型依赖KV缓存加速解码,但其内存使用随序列增长而剧增,造成部署困难;我们提出的LAQ框架通过伪前瞻查询实现更精准的缓存淘汰,在受限内存下优于现有方法。
Large language models accelerate decoding with KV cache, but its memory use escalates with sequence length, leading to inefficiencies; our proposed LAQ framework uses pseudo lookahead queries for more accurate cache eviction, outperforming existing methods under constrained memory.

Authors:Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen
Title: Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Abstract:
We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.
中文摘要:本研究提出一个元学习框架,将大语言模型的推理过程视为伪梯度下降,通过将问题转化为元学习任务进行训练,使模型获得可推广的推理能力,并验证了元学习技术对提升模型性能的实用价值。
English Summary: This study introduces a meta-learning framework that interprets large language models' reasoning as pseudo-gradient descent, demonstrating how training on diverse questions enables generalization to new problems through established meta-learning principles.

Authors:Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang
Title: What Can RL Bring to VLA Generalization? An Empirical Study
Abstract:
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io
中文: 与监督微调相比,强化学习微调(尤其是PPO算法)显著提升了视觉语言行动模型在语义理解和执行鲁棒性方面的泛化能力,同时保持了相当的视觉鲁棒性。
English: Reinforcement learning fine-tuning, especially with PPO, significantly improves the generalization of Vision-Language Action models in semantic understanding and execution robustness compared to supervised fine-tuning, while maintaining similar visual robustness.

Authors:Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang
Title: What Can RL Bring to VLA Generalization? An Empirical Study
Abstract:
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io
中文: 与监督微调相比,强化学习微调(尤其是PPO算法)显著提升了视觉语言行动模型在语义理解和执行鲁棒性方面的泛化能力,同时保持了相当的视觉鲁棒性。
English: Reinforcement learning fine-tuning, especially with PPO, significantly improves the generalization of Vision-Language Action models in semantic understanding and execution robustness compared to supervised fine-tuning, while maintaining similar visual robustness.

Authors:Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu
Title: Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents
Abstract:
Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.
中文:MemGAS框架通过构建多粒度记忆关联与自适应检索机制,有效解决了大语言模型在长对话记忆中的局限性,在多个基准测试中通过智能记忆整合和降噪实现了优于现有方法的性能表现。
English: The MemGAS framework addresses limitations in long-term dialogue memory for LLMs by implementing multi-granularity memory association and adaptive retrieval, outperforming existing methods across multiple benchmarks through intelligent memory consolidation and noise reduction.

Authors:Yixuan Wang, Yijun Liu, Shiyu ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Title: Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding
Abstract:
Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 5$\sim$15\% improvements in decoding speed.
中文摘要:反射验证是一种无需训练的方法,利用大语言模型的内在反思能力对推测解码中的草稿令牌进行语义验证,在不损失性能的前提下显著提升令牌接受率和解码速度。
English Summary: Reflective Verification is a training-free method that uses the inherent reflective capacity of large language models to semantically verify draft tokens during speculative decoding, achieving higher acceptance rates and faster decoding speeds without performance loss.

Authors:Danyang Zhang, Situo Zhang, Ziyue Yang, Zichen Zhu, Zihan Zhao, Ruisheng Cao, Lu Chen, Kai Yu
Title: ProgRM: Build Better GUI Agents with Progress Rewards
Abstract:
LLM-based (Large Language Model) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.
中文: 针对基于大语言模型的图形界面代理训练数据稀缺问题,本研究提出进度奖励模型(ProgRM)及自标注算法,通过提供密集中间奖励显著提升了代理性能,实验验证了其优越性。
English: LLM-based GUI agents face training data scarcity, so this study introduces the Progress Reward Model (ProgRM) with a self-annotation algorithm to provide dense intermediate rewards, significantly improving agent performance in experiments.

Authors:Zeen Song, Wenwen Qiang, Siyu Zhao, Changwen Zheng, Gang Hua
Title: Reward Model Generalization for Compute-Aware Test-Time Reasoning
Abstract:
External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.
Chinese: 外部测试时推理通过生成多条推理路径并利用奖励模型筛选最优解来增强大语言模型,其中提出的计算感知树搜索(CATS)框架能动态优化搜索行为,在固定计算预算下持续优于其他方法。
English: External test-time reasoning improves LLMs by generating multiple reasoning paths and using a reward model to select the best one, with the proposed Compute-Aware Tree Search (CATS) framework dynamically optimizing search behavior to outperform other methods under fixed compute budgets.

Authors:Yi Su, Jiayi Zhang, Shu Yang, Xinhai Wang, Lijie Hu, Di Wang
Title: Understanding How Value Neurons Shape the Generation of Specified Values in LLMs
Abstract:
Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.
中文摘要:ValueLocate是一个基于心理学价值框架的机制可解释性方法,通过定位和操控大语言模型中的关键神经元来改变价值取向,为价值对齐建立了因果关联。
English Summary: ValueLocate is a mechanistic interpretability framework that identifies value-critical neurons in LLMs using behavioral datasets and targeted manipulation, advancing value alignment by bridging psychological frameworks with neural analysis.

Authors:Kazuki Hayashi, Shintaro Ozaki, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Title: Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
Abstract:
Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs' ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.
中文: 大规模视觉语言模型能够用自然语言描述色觉缺陷,但在图像任务中无法模拟色觉缺陷者的实际颜色感知,这表明需要开发更具感知包容性的多模态人工智能系统。
English: Large-scale Vision Language Models can describe color vision deficiencies in natural language but fail to simulate the actual color perception of affected individuals in image tasks, highlighting the need for more perceptually inclusive multimodal AI systems.

Authors:Wenwen Qiang, Jingyao Wang, Zeen Song, Jiangmeng Li, Changwen Zheng
Title: On the Out-of-Distribution Generalization of Self-Supervised Learning
Abstract:
In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.
本文分析了自监督学习的分布外泛化问题,将其局限归因于伪相关性,并提出一种基于干预后分布的批量采样策略来提升分布外性能。
This paper analyzes self-supervised learning's out-of-distribution generalization, attributing its limitations to spurious correlations and proposing a post-intervention distribution-based batch sampling strategy to enhance OOD performance.

Authors:Xingyu Zhang, Wenwen Qiang, Siyu Zhao, Huijie Guo, Jiangmeng Li, Chuxiong Sun, Changwen Zheng
Title: CAIFormer: A Causal Informed Transformer for Multivariate Time Series Forecasting
Abstract:
Most existing multivariate time series forecasting methods adopt an all-to-all paradigm that feeds all variable histories into a unified model to predict their future values without distinguishing their individual roles. However, this undifferentiated paradigm makes it difficult to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations. To address this limitation, we propose an all-to-one forecasting paradigm that predicts each target variable separately. Specifically, we first construct a Structural Causal Model from observational data and then, for each target variable, we partition the historical sequence into four sub-segments according to the inferred causal structure: endogenous, direct causal, collider causal, and spurious correlation. The prediction relies solely on the first three causally relevant sub-segments, while the spurious correlation sub-segment is excluded. Furthermore, we propose Causal Informed Transformer (CAIFormer), a novel forecasting model comprising three components: Endogenous Sub-segment Prediction Block, Direct Causal Sub-segment Prediction Block, and Collider Causal Sub-segment Prediction Block, which process the endogenous, direct causal, and collider causal sub-segments, respectively. Their outputs are then combined to produce the final prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CAIFormer.
中文: 本文提出了一种“逐一预测”范式,通过构建结构因果模型将历史数据划分为因果相关片段并排除伪相关,进而开发CAIFormer模型分别处理这些片段以生成最终预测,多基准数据集实验验证了其有效性。
English: The paper introduces an all-to-one forecasting paradigm that predicts each target variable separately by constructing a Structural Causal Model to partition historical data into causally relevant segments, excluding spurious correlations, and proposes the CAIFormer model to process these segments for final predictions, demonstrating effectiveness in experiments.

Authors:Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu
Title: Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.
中文摘要:Align-GRAG提出了一种推理引导的双重对齐框架,通过联合优化图编码器与LLM推理链来解决图RAG的效率低下和表征鸿沟问题,并在GraphQA基准测试中验证了其有效性。
English Summary: Align-GRAG introduces a reasoning-guided dual alignment framework that optimizes graph encoders with LLM-summarized reasoning chains to address graph RAG's inefficiency and representation gap challenges, validated through GraphQA benchmark experiments.

Authors:Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu
Title: Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimize a graph encoder with an LLM-summarized reasoning chain. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on the GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The codes are available in this repository\footnote{https://anonymous.4open.science/r/Align-GRAG-F3D8/}.
中文摘要:Align-GRAG提出了一种推理引导的双重对齐框架,通过联合优化图编码器与LLM推理链来解决图RAG的效率低下和表征鸿沟问题,并在GraphQA基准测试中验证了其有效性。
English Summary: Align-GRAG introduces a reasoning-guided dual alignment framework that optimizes graph encoders with LLM-summarized reasoning chains to address graph RAG's inefficiency and representation gap challenges, validated through GraphQA benchmark experiments.

Authors:Zifeng Wang, Qiao Jin, Jiacheng Lin, Junyi Gao, Jathurshan Pradeepkumar, Pengcheng Jiang, Benjamin Danek, Zhiyong Lu, Jimeng Sun
Title: TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials
Abstract:
Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.
中文:本文介绍了TrialPanorama,这是一个大规模结构化临床试验数据库,可作为试验规划和评估的统一资源,基准测试表明当前大语言模型仍无法满足高风险的临床应用需求。
English: This work introduces TrialPanorama, a large-scale structured database of clinical trials that serves as a unified resource for trial planning and evaluation, with benchmark tests showing current LLMs still fall short for high-stakes clinical applications.

Authors:Johannes Kaiser, Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis
Title: Laplace Sample Information: Data Informativeness Through a Bayesian Lens
Abstract:
Accurately estimating the informativeness of individual samples in a dataset is an important objective in deep learning, as it can guide sample selection, which can improve model efficiency and accuracy by removing redundant or potentially harmful samples. We propose Laplace Sample Information (LSI) measure of sample informativeness grounded in information theory widely applicable across model architectures and learning settings. LSI leverages a Bayesian approximation to the weight posterior and the KL divergence to measure the change in the parameter distribution induced by a sample of interest from the dataset. We experimentally show that LSI is effective in ordering the data with respect to typicality, detecting mislabeled samples, measuring class-wise informativeness, and assessing dataset difficulty. We demonstrate these capabilities of LSI on image and text data in supervised and unsupervised settings. Moreover, we show that LSI can be computed efficiently through probes and transfers well to the training of large models.
中文摘要:提出的拉普拉斯样本信息(LSI)度量通过贝叶斯近似和KL散度有效量化样本信息量,可在多种模型和数据类型中实现数据排序、错误检测及数据集难度评估等应用。
English Summary: The proposed Laplace Sample Information (LSI) measure effectively quantifies sample informativeness using Bayesian approximation and KL divergence, enabling applications like data ordering, error detection, and dataset assessment across diverse models and data types.

Authors:Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
Title: RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals
Abstract:
The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
Chinese: 提出的行思考(RoT)方法通过迭代式逐行遍历表格来扩展推理并基于反思优化结果,无需训练即可减少幻觉,同时以比长思维链更高的效率实现了最先进的性能。
English: The proposed Row-of-Thought (RoT) method performs iterative row-wise table traversal to extend reasoning and refine results through reflection, reducing hallucinations without training while achieving state-of-the-art performance with higher efficiency than Long Chain-of-Thought.

Authors:Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, Xiangyu Zhao
Title: Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning
Abstract:
Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.
中文: ReasonRAG提出基于过程监督强化学习的细粒度奖励机制,显著提升智能检索增强生成系统的性能,在仅需少量训练数据的情况下于多个基准测试中超越现有方法。
English: ReasonRAG introduces process-supervised reinforcement learning with fine-grained rewards to enhance agentic RAG systems, achieving superior performance on benchmarks using significantly fewer training instances than existing methods.

Authors:Shijie Xuyang, Xianzhen Luo, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che
Title: Is Compression Really Linear with Code Intelligence?
Abstract:
Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
Chinese: 本研究通过全面评估推翻了数据压缩与代码智能之间的线性关系假说,揭示了二者实际存在的对数关联,并提出新的训练方法和验证集以实现公平评估。
English: This study refutes the linear relationship between data compression and code intelligence in LLMs, revealing instead a logarithmic correlation through comprehensive evaluation and introducing a novel training method and validation set for equitable assessment.

Authors:Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong
Title: Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
Abstract:
Large language models (LLMs) excel at complex tasks thanks to advances in reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode's contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.
中文: 提出的“学会思考”(L2T)框架通过信息论强化学习优化大型语言模型的推理效率,在降低计算成本的同时,保持其在多种任务中的高性能表现。
English: The proposed Learning to Think (L2T) framework enhances large language models by optimizing reasoning efficiency through information-theoretic reinforcement learning, reducing computational costs while maintaining high performance across diverse tasks.

Authors:Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe
Title: IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.
中文摘要:IterKey是一种基于大语言模型的迭代式关键词生成框架,通过稀疏检索增强检索增强生成技术,在保持可解释性的同时显著提升了问答任务的准确性,性能媲美密集检索方法。
English Summary: IterKey is an LLM-driven iterative keyword generation framework that enhances Retrieval-Augmented Generation by balancing accuracy and interpretability through sparse retrieval, achieving significant accuracy improvements over traditional methods.

Authors:Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han
Title: Benchmarking Retrieval-Augmented Generation for Chemistry
Abstract:
Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.
中文: 本文提出了化学领域检索增强生成(RAG)评估基准ChemRAG-Bench和工具包ChemRAG-Toolkit,实验证明RAG比直接推理方法性能提升17.4%,并为该领域未来研究提供了实用建议。
English: This paper introduces ChemRAG-Bench, a benchmark for evaluating retrieval-augmented generation (RAG) in chemistry, and ChemRAG-Toolkit, which demonstrates RAG's 17.4% performance improvement over direct inference methods while providing practical recommendations for future research.

Authors:Akansha Shukla, Parth Atulbhai Gandhi, Yuval Elovici, Asaf Shabtai
Title: RuleGenie: SIEM Detection Rule Set Optimization
Abstract:
SIEM systems serve as a critical hub, employing rule-based logic to detect and respond to threats. Redundant or overlapping rules in SIEM systems lead to excessive false alerts, degrading analyst performance due to alert fatigue, and increase computational overhead and response latency for actual threats. As a result, optimizing SIEM rule sets is essential for efficient operations. Despite the importance of such optimization, research in this area is limited, with current practices relying on manual optimization methods that are both time-consuming and error-prone due to the scale and complexity of enterprise-level rule sets. To address this gap, we present RuleGenie, a novel large language model (LLM) aided recommender system designed to optimize SIEM rule sets. Our approach leverages transformer models' multi-head attention capabilities to generate SIEM rule embeddings, which are then analyzed using a similarity matching algorithm to identify the top-k most similar rules. The LLM then processes the rules identified, utilizing its information extraction, language understanding, and reasoning capabilities to analyze rule similarity, evaluate threat coverage and performance metrics, and deliver optimized recommendations for refining the rule set. By automating the rule optimization process, RuleGenie allows security teams to focus on more strategic tasks while enhancing the efficiency of SIEM systems and strengthening organizations' security posture. We evaluated RuleGenie on a comprehensive set of real-world SIEM rule formats, including Splunk, Sigma, and AQL (Ariel query language), demonstrating its platform-agnostic capabilities and adaptability across diverse security infrastructures. Our experimental results show that RuleGenie can effectively identify redundant rules, which in turn decreases false positive rates and enhances overall rule efficiency.
中文摘要:SIEM系统因规则集冗余导致误报和性能下降,为此提出RuleGenie——基于大语言模型的自动化规则优化工具,通过识别冗余规则有效提升系统效率并降低误报率,适用于多种安全平台。
English Summary: SIEM systems suffer from inefficient rule sets causing false alerts and performance issues, prompting the development of RuleGenie, an LLM-powered tool that automates rule optimization to reduce redundancies and improve security operations across multiple platforms.

Authors:Jingyao Wang, Jianqi Zhang, Wenwen Qiang, Changwen Zheng
Title: Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation
Abstract:
Despite the strength of the Segment Anything Model (SAM), it struggles with generalization issues in open-vocabulary multi-entity segmentation (OVMS). Through empirical and causal analyses, we find that (i) the prompt bias is the primary cause of the generalization issues; (ii) this bias is closely tied to the task-irrelevant generating factors within the prompts, which act as confounders and affect generalization. To address the generalization issues, we aim to propose a method that can calibrate prompts to eliminate confounders for accurate OVMS. Building upon the causal analysis, we propose that the optimal prompt for OVMS should contain only task-relevant causal factors. We define it as the causal prompt, serving as the goal of calibration. Next, our theoretical analysis, grounded by causal multi-distribution consistency theory, proves that this prompt can be obtained by enforcing segmentation consistency and optimality. Inspired by this, we propose CPC-SAM, a Causal Prompt Calibration method for SAM to achieve accurate OVMS. It integrates a lightweight causal prompt learner (CaPL) into SAM to obtain causal prompts. Specifically, we first generate multiple prompts using random annotations to simulate diverse distributions and then reweight them via CaPL by enforcing causal multi-distribution consistency in both task and entity levels. To ensure obtaining causal prompts, CaPL is optimized by minimizing the cumulative segmentation loss across the reweighted prompts to achieve consistency and optimality. A bi-level optimization strategy alternates between optimizing CaPL and SAM, ensuring accurate OVMS. Extensive experiments validate its superiority.
Chinese: 针对Segment Anything Model (SAM)在开放词汇多实体分割中的泛化问题,本文提出CPC-SAM方法,通过因果分析和多分布一致性校准提示,消除混杂因素,从而实现精确分割。
English: The Segment Anything Model (SAM) faces generalization issues in open-vocabulary multi-entity segmentation due to prompt bias, which is addressed by the proposed CPC-SAM method that calibrates prompts through causal analysis and multi-distribution consistency to achieve accurate segmentation.

Authors:Shifeng Liu, Xinglong Mao, Sirui Zhao, Peiming Li, Tong Xu, Enhong Chen
Title: MER-CLIP: AU-Guided Vision-Language Alignment for Micro-Expression Recognition
Abstract:
As a critical psychological stress response, micro-expressions (MEs) are fleeting and subtle facial movements revealing genuine emotions. Automatic ME recognition (MER) holds valuable applications in fields such as criminal investigation and psychological diagnosis. The Facial Action Coding System (FACS) encodes expressions by identifying activations of specific facial action units (AUs), serving as a key reference for ME analysis. However, current MER methods typically limit AU utilization to defining regions of interest (ROIs) or relying on specific prior knowledge, often resulting in limited performance and poor generalization. To address this, we integrate the CLIP model's powerful cross-modal semantic alignment capability into MER and propose a novel approach namely MER-CLIP. Specifically, we convert AU labels into detailed textual descriptions of facial muscle movements, guiding fine-grained spatiotemporal ME learning by aligning visual dynamics and textual AU-based representations. Additionally, we introduce an Emotion Inference Module to capture the nuanced relationships between ME patterns and emotions with higher-level semantic understanding. To mitigate overfitting caused by the scarcity of ME data, we put forward LocalStaticFaceMix, an effective data augmentation strategy blending facial images to enhance facial diversity while preserving critical ME features. Finally, comprehensive experiments on four benchmark ME datasets confirm the superiority of MER-CLIP. Notably, UF1 scores on CAS(ME)3 reach 0.7832, 0.6544, and 0.4997 for 3-, 4-, and 7-class classification tasks, significantly outperforming previous methods.
中文: 本研究提出MER-CLIP新方法,利用CLIP的跨模态对齐能力,将动作单元转化为文本描述并结合情感推理模块,显著提升了微表情识别的性能,在基准数据集上取得优越结果。
English: The study introduces MER-CLIP, a novel approach that leverages CLIP's cross-modal alignment to enhance micro-expression recognition by converting action units into textual descriptions and incorporating an emotion inference module, achieving superior performance on benchmark datasets.

Authors:Elad Feldman, Jacob Shams, Dudi Biton, Alfred Chen, Shaoyuan Xie, Satoru Koda, Yisroel Mirsky, Asaf Shabtai, Yuval Elovici, Ben Nassi
Title: PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting
Abstract:
The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector's confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, "manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models' average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.
中文: 本研究揭示了PaniCar现象——应急车辆灯光会导致自动驾驶汽车的目标检测器置信度骤降而漏检,并提出Caracetamol实时防护框架,能显著提升检测系统在应急灯光下的稳定性和检测能力。
English: This study identifies PaniCar, a digital phenomenon where emergency vehicle lights cause object detectors in autonomous vehicles to lose confidence and miss detections, posing safety risks, and proposes Caracetamol, a real-time framework that significantly improves detection stability and performance under such conditions.

Authors:Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang
Title: Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Abstract:
In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level skills, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme. The project page is at https://sites.google.com/view/hi-co-self-play.
中文: 本文提出分层协同自博弈(HCSP)强化学习框架,通过战略决策与敏捷控制分离的三阶段训练,使无人机在3v3排球赛中实现82.9%胜率并涌现团队协作行为。
English: This paper introduces Hierarchical Co-Self-Play (HCSP), a reinforcement learning framework that trains drones to master 3v3 volleyball through strategic coordination and agile control, achieving an 82.9% win rate and emergent team behaviors.

Authors:Lorenzo Asquini, Manos Frouzakis, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu, Francesco Silvestri
Title: Accelerating Triangle Counting with Real Processing-in-Memory Systems
Abstract:
Triangle Counting (TC) is a procedure that involves enumerating the number of triangles within a graph. It has important applications in numerous fields, such as social or biological network analysis and network security. TC is a memory-bound workload that does not scale efficiently in conventional processor-centric systems due to several memory accesses across large memory regions and low data reuse. However, recent Processing-in-Memory (PIM) architectures present a promising solution to alleviate these bottlenecks. Our work presents the first TC algorithm that leverages the capabilities of the UPMEM system, the first commercially available PIM architecture, while at the same time addressing its limitations. We use a vertex coloring technique to avoid expensive communication between PIM cores and employ reservoir sampling to address the limited amount of memory available in the PIM cores' DRAM banks. In addition, our work makes use of the Misra-Gries summary to speed up counting triangles on graphs with high-degree nodes and uniform sampling of the graph edges for quicker approximate results. Our PIM implementation surpasses state-of-the-art CPU-based TC implementations when processing dynamic graphs in Coordinate List format, showcasing the effectiveness of the UPMEM architecture in addressing TC's memory-bound challenges.
中文: 本研究首次为UPMEM内存处理架构设计了三角形计数算法,采用顶点着色和蓄水池采样技术突破内存限制,在处理动态图时性能超越了基于CPU的现有方法。
English: This work introduces the first triangle counting algorithm for the UPMEM processing-in-memory architecture, employing vertex coloring and reservoir sampling to overcome memory constraints and outperforming CPU-based methods on dynamic graphs.

Authors:Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar
Title: BLAB: Brutally Long Audio Bench
Abstract:
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.
中文: Brutally Long Audio Bench (BLAB) 是一个针对长对话语音的挑战性基准测试,揭示了音频语言模型在定位和时间推理等任务上的显著不足,且性能随音频时长增加而下降。
English: The Brutally Long Audio Bench (BLAB) is a challenging benchmark designed to evaluate audio language models on long-form conversational speech, revealing their significant struggles with tasks like localization and temporal reasoning as audio duration increases.

Authors:Eran Aizikovich, Dudu Mimran, Edita Grolman, Yuval Elovici, Asaf Shabtai
Title: Rogue Cell: Adversarial Attack and Defense in Untrusted O-RAN Setup Exploiting the Traffic Steering xApp
Abstract:
The Open Radio Access Network (O-RAN) architecture is revolutionizing cellular networks with its open, multi-vendor design and AI-driven management, aiming to enhance flexibility and reduce costs. Although it has many advantages, O-RAN is not threat-free. While previous studies have mainly examined vulnerabilities arising from O-RAN's intelligent components, this paper is the first to focus on the security challenges and vulnerabilities introduced by transitioning from single-operator to multi-operator RAN architectures. This shift increases the risk of untrusted third-party operators managing different parts of the network. To explore these vulnerabilities and their potential mitigation, we developed an open-access testbed environment that integrates a wireless network simulator with the official O-RAN Software Community (OSC) RAN intelligent component (RIC) cluster. This environment enables realistic, live data collection and serves as a platform for demonstrating APATE (adversarial perturbation against traffic efficiency), an evasion attack in which a malicious cell manipulates its reported key performance indicators (KPIs) and deceives the O-RAN traffic steering to gain unfair allocations of user equipment (UE). To ensure that O-RAN's legitimate activity continues, we introduce MARRS (monitoring adversarial RAN reports), a detection framework based on a long-short term memory (LSTM) autoencoder (AE) that learns contextual features across the network to monitor malicious telemetry (also demonstrated in our testbed). Our evaluation showed that by executing APATE, an attacker can obtain a 248.5% greater UE allocation than it was supposed to in a benign scenario. In addition, the MARRS detection method was also shown to successfully classify malicious cell activity, achieving accuracy of 99.2% and an F1 score of 0.978.
中文:本文首次聚焦于开放无线接入网络(O-RAN)从单一运营商向多运营商架构转变所引发的安全挑战,提出了APATE攻击方法及基于长短期记忆自编码器的MARRS检测框架,后者能以99.2%的准确率有效识别恶意基站行为。
English: This paper pioneers the investigation of security vulnerabilities in Open Radio Access Network (O-RAN) arising from its transition to multi-operator architectures, introducing both APATE—an evasion attack exploiting manipulated performance metrics—and MARRS, an LSTM-based detection framework that effectively identifies malicious activities with high accuracy.

Authors:Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
Title: Empowering Agentic Video Analytics Systems with Video Language Models
Abstract:
AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVAS, a VLM-powered system designed for open-ended, advanced video analytics. AVAS incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVAS achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVAS-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVAS-100, AVAS achieves top-tier performance with an accuracy of 75.8%.
Chinese: AVAS 是一种先进的视频分析系统,通过事件知识图谱和智能检索生成机制克服了现有视频语言模型的局限,在多个基准测试和新超长视频数据集上实现了最优性能。
English: AVAS is an advanced video analytics system that overcomes the limitations of current Video-Language Models by using Event Knowledge Graphs and an agentic retrieval-generation mechanism, achieving state-of-the-art performance on benchmarks and a new ultra-long video dataset.

Authors:Ege Özsoy, Arda Mamur, Felix Tristram, Chantal Pellegrini, Magdalena Wysocki, Benjamin Busam, Nassir Navab
Title: EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
Abstract:
Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.
中文摘要:EgoExOR推出了首个融合第一人称与第三人称视角的手术室数据集,通过多模态数据和精细场景图标注,为临床交互建模提供了全面支持,奠定了新一代手术感知技术的基础。
English Summary: EgoExOR introduces the first operating room dataset combining egocentric and exocentric perspectives with multimodal data, enabling comprehensive clinical interaction modeling through detailed scene graph annotations for enhanced surgical perception.

Authors:Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
中文摘要:视觉语言模型在全中心空间推理方面存在不足,但通过多视角数据集微调后性能显著提升,ViewSpatial-Bench基准测试验证了这一突破。
English Summary: Vision-language models struggle with allocentric spatial reasoning but show significant improvement when fine-tuned on multi-perspective datasets, as demonstrated by the new ViewSpatial-Bench benchmark.

Authors:Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
中文摘要:视觉语言模型在全中心空间推理方面存在不足,但通过多视角数据集微调后性能显著提升,ViewSpatial-Bench基准测试验证了这一突破。
English Summary: Vision-language models struggle with allocentric spatial reasoning but show significant improvement when fine-tuned on multi-perspective datasets, as demonstrated by the new ViewSpatial-Bench benchmark.

Authors:Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Title: Agentic 3D Scene Generation with Spatially Contextualized VLMs
Abstract:
Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications. Project page: https://spatctxvlm.github.io/project_page/.
中文: 本研究提出了一种新框架,通过整合动态空间上下文来增强视觉语言模型的三维空间推理能力,实现了面向具身AI应用的高级场景生成与交互式编辑功能。
English: This work introduces a novel framework that enhances vision-language models' capacity for 3D spatial reasoning by integrating a dynamic spatial context, enabling advanced scene generation and interactive editing for embodied AI applications.

Authors:Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang
Title: SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models
Abstract:
Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.
中文:当前视频大模型在细粒度时空理解方面存在不足,为此我们构建了SAMA-239K数据集、提出具备增强定位能力的SAMA模型并建立SAMA-Bench基准,通过统一视频指代理解与定位任务实现了突破性性能。
English: Current Video LMMs struggle with fine-grained spatio-temporal understanding, prompting the development of SAMA-239K dataset, SAMA model with enhanced grounding capabilities, and SAMA-Bench benchmark to unify video referring and grounding tasks, achieving state-of-the-art performance.

Authors:Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Title: ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
Abstract:
Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.
中文: CoT-RVS是一种创新的免训练框架,利用多模态大语言模型的零样本思维链能力,通过时间语义推理提升推理视频对象分割,在处理复杂查询时显著优于现有方法。
English: CoT-RVS is a novel, training-free framework that leverages the zero-shot Chain-of-Thought capability of Multimodal Large Language Models to enhance Reasoning Video Object Segmentation by integrating temporal-semantic reasoning, significantly outperforming previous methods in handling complex queries.

Authors:Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Title: CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
Abstract:
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.
中文: CoT-RVS是一种创新的免训练框架,利用多模态大语言模型的零样本思维链能力,通过时间语义推理提升推理视频对象分割,在处理复杂查询时显著优于现有方法。
English: CoT-RVS is a novel, training-free framework that leverages the zero-shot Chain-of-Thought capability of Multimodal Large Language Models to enhance Reasoning Video Object Segmentation by integrating temporal-semantic reasoning, significantly outperforming previous methods in handling complex queries.

Authors:Haolei Xu, Yuchen Yan, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Shengpei Jiang, Kaitao Song, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning
Abstract:
Large language models (LLMs) have achieved remarkable progress on mathematical tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.
Chinese: 本文提出思维跳跃桥接任务,通过自动检测并补全数学思维链中缺失的推理步骤,实验证明基于桥接数据训练的模型在数学推理和逻辑任务中均获得显著性能提升与泛化能力增强。
English: This paper introduces the CoT Thought Leap Bridge Task to automatically identify and fill missing reasoning steps in mathematical Chain-of-Thought datasets, demonstrating through experiments that models trained with these bridged datasets achieve superior performance and generalization across mathematical and logical reasoning tasks.

Authors:Yanzhe Wen, Xunkai Li, Qi Zhang, Zhu Lei, Guang Zeng, Rong-Hua Li, Guoren Wang
Title: When LLMs meet open-world graph learning: a new perspective for unlabeled data uncertainty
Abstract:
Recently, large language models (LLMs) have significantly advanced text-attributed graph (TAG) learning. However, existing methods inadequately handle data uncertainty in open-world scenarios, especially concerning limited labeling and unknown-class nodes. Prior solutions typically rely on isolated semantic or structural approaches for unknown-class rejection, lacking effective annotation pipelines. To address these limitations, we propose Open-world Graph Assistant (OGA), an LLM-based framework that combines adaptive label traceability, which integrates semantics and topology for unknown-class rejection, and a graph label annotator to enable model updates using newly annotated nodes. Comprehensive experiments demonstrate OGA's effectiveness and practicality.
中文摘要:提出的开放世界图助手(OGA)框架通过融合语义-拓扑的未知类别识别与自动标注功能,有效解决了文本属性图在开放环境中的数据不确定性问题,实验验证了其优越性能。
English Summary: The proposed Open-world Graph Assistant (OGA) framework addresses data uncertainty in text-attributed graphs by integrating semantic-topological unknown-class rejection and automated annotation capabilities, demonstrating strong experimental performance.

Authors:Yixu Wang, Jiaxin Song, Yifeng Gao, Xin Wang, Yang Yao, Yan Teng, Xingjun Ma, Yingchun Wang, Yu-Gang Jiang
Title: SafeVid: Toward Safety Aligned Video Large Multimodal Models
Abstract:
As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (https://huggingface.co/datasets/yxwang/SafeVid-350K) publicly available.
中文摘要:SafeVid框架通过将文本安全对齐能力迁移到视频领域,利用详细视频描述作为桥梁,结合35万对视频安全偏好数据集和直接偏好优化,显著提升了视频大模型的安全性。
English Summary: SafeVid is a framework that enhances video large multimodal models' safety by transferring textual safety alignments to video contexts through detailed descriptions, utilizing a 350,000-pair dataset and Direct Preference Optimization to achieve significant safety improvements.

Authors:Quoc-Huy Trinh, Minh-Van Nguyen, Jung Zeng, Ulas Bagci, Debesh Jha
Title: PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging
Abstract:
Recent advancements in prompt-based medical image segmentation have enabled clinicians to identify tumors using simple input like bounding boxes or text prompts. However, existing methods face challenges when doctors need to interact through natural language or when position reasoning is required - understanding spatial relationships between anatomical structures and pathologies. We present PRS-Med, a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs. Additionally, we introduce the MMRS dataset (Multimodal Medical in Positional Reasoning Segmentation), which provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging. PRS-Med demonstrates superior performance across six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, RGB), significantly outperforming state-of-the-art methods in both segmentation accuracy and position reasoning. Our approach enables intuitive doctor-system interaction through natural language, facilitating more efficient diagnoses. Our dataset pipeline, model, and codebase will be released to foster further research in spatially-aware multimodal reasoning for medical applications.
中文:PRS-Med是一种创新框架,融合视觉语言模型与分割技术,能生成精确的掩码和空间推理,并通过MMRS数据集支持,显著提升多模态医学影像的分析能力。
English: PRS-Med is a novel framework that combines vision-language models with segmentation to produce precise masks and spatial reasoning, supported by the MMRS dataset to enhance medical imaging analysis across multiple modalities.

Authors:Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, David Lo
Title: Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks
Abstract:
Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 291 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOPs, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.
中文: 本文系统评述了291个用于评估大语言模型在软件工程中应用的基准,分析其构建方法、现有局限及未来挑战,旨在为开发更有效的评估工具提供见解。
English: This paper provides a comprehensive review of 291 benchmarks for evaluating large language models in software engineering, analyzing their construction, limitations, and future challenges to guide more effective evaluation tools.

Authors:Lu Dong, Haiyu Zhang, Hongjie Zhang, Yifei Huang, Zhen-Hua Ling, Yu Qiao, Limin Wang, Yali Wang
Title: Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining
Abstract:
The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are highly similar to the anchor sample, directly regarding them as negative samples leads to difficulties for optimization and ignores the correlations between these similar samples and the anchor sample. To address this, we propose Positive Sample Mining (PSM), a novel framework that mines positive samples from the training set to provide more discriminative supervision. Specifically, for a given anchor sample, we partition the remaining training set into semantically similar and dissimilar subsets based on the similarity of their text queries. To effectively leverage these correlations, we introduce a PSM-guided contrastive loss to ensure that the anchor proposal is closer to similar samples and further from dissimilar ones. Additionally, we design a PSM-guided rank loss to ensure that similar samples are closer to the anchor proposal than to the negative intra-video proposal, aiming to distinguish the anchor proposal and the negative intra-video proposal. Experiments on the WSTSG and grounded VideoQA tasks demonstrate the effectiveness and superiority of our method.
Chinese: 提出的正样本挖掘(PSM)框架通过从训练集中挖掘语义相似的样本,并采用对比损失和排序损失来增强锚点提案与不相似或负样本之间的区分能力,从而改进了弱监督时序语句定位任务。
English: The proposed Positive Sample Mining (PSM) framework enhances weakly supervised temporal sentence grounding by mining semantically similar samples from the training set and employing contrastive and rank losses to improve discrimination between anchor proposals and dissimilar or negative samples.

Authors:Siqi Wang, Xing Hu, Xin Xia, Xinyu Wang
Title: ActRef: Enhancing the Understanding of Python Code Refactoring with Action-Based Analysis
Abstract:
Refactoring, the process of improving the code structure of a software system without altering its behavior, is crucial for managing code evolution in software development. Identifying refactoring actions in source code is essential for understanding software evolution and guiding developers in maintaining and improving the code quality. This study presents an action-based Refactoring Analysis Framework named ActRef, a novel algorithm designed to advance the detection and understanding of Python refactorings through a unique code change action-based analysis of code changes. ActRef mining multiple refactoring types (e.g., move, rename, extract, and inline operations) based on diff actions, covering multiple granularity levels including variable, method, class, and module levels. By focusing on the code change actions, ActRef provides a Python-adaptive solution to detect intricate refactoring patterns. Our evaluation, conducted on 1,914 manually validated refactoring instances from 136 open-source Python projects. The evaluation results show that ActRef achieves high precision(0.80) and recall(0.92), effectively identifying multiple refactoring types. Compared with leading baselines, including PyRef, PyRef with MLRefScanner, DeepSeek-R1 and ChatGPT-4, ActRef consistently demonstrates superior performance in detecting Python refactorings across various types. While matching PyRef in runtime efficiency, ActRef supports a broader spectrum of refactoring types and more refactoring mining levels. ActRef shows an effective and scalable approach for mining refactorings in dynamic Python codebases and introduces a new perspective on understanding code.
中文: 本研究提出的ActRef重构分析框架通过基于代码变更行为的分析,能够有效检测多种Python重构类型,在保持高准确率和召回率的同时,其检测性能优于现有工具。
English: This study introduces ActRef, a refactoring analysis framework that detects various Python refactoring types through code change actions, achieving high precision and recall while outperforming existing tools in detection capabilities.

Authors:Xunkai Li, Zhengyu Wu, Kaichi Yu, Hongchao Qin, Guang Zeng, Rong-Hua Li, Guoren Wang
Title: Toward Data-centric Directed Graph Learning: An Entropy-driven Approach
Abstract:
The directed graph (digraph), as a generalization of undirected graphs, exhibits superior representation capability in modeling complex topology systems and has garnered considerable attention in recent years. Despite the notable efforts made by existing DiGraph Neural Networks (DiGNNs) to leverage directed edges, they still fail to comprehensively delve into the abundant data knowledge concealed in the digraphs. This data-level limitation results in model-level sub-optimal predictive performance and underscores the necessity of further exploring the potential correlations between the directed edges (topology) and node profiles (feature and labels) from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we propose \textbf{E}ntropy-driven \textbf{D}igraph knowl\textbf{E}dge distillatio\textbf{N} (EDEN), which can serve as a data-centric digraph learning paradigm or a model-agnostic hot-and-plug data-centric Knowledge Distillation (KD) module. The core idea is to achieve data-centric ML, guided by our proposed hierarchical encoding theory for structured data. Specifically, EDEN first utilizes directed structural measurements from a topology perspective to construct a coarse-grained Hierarchical Knowledge Tree (HKT). Subsequently, EDEN quantifies the mutual information of node profiles to refine knowledge flow in the HKT, enabling data-centric KD supervision within model training. As a general framework, EDEN can also naturally extend to undirected scenarios and demonstrate satisfactory performance. In our experiments, EDEN has been widely evaluated on 14 (di)graph datasets (homophily and heterophily) and across 4 downstream tasks. The results demonstrate that EDEN attains SOTA performance and exhibits strong improvement for prevalent (Di)GNNs.
中文: 有向图在复杂系统建模中具有优越表现,但现有DiGNN方法未能充分挖掘其数据潜力;本文提出的EDEN框架通过熵驱动的知识蒸馏,构建层次化知识树实现数据中心化学习,显著提升了模型性能。
English: Directed graphs offer enhanced representation for complex systems, but current DiGNNs fail to fully exploit their data potential, prompting the proposed EDEN framework that uses entropy-driven knowledge distillation to boost model performance through hierarchical data encoding.

Authors:Ehtesamul Azim, Dongjie Wang, Tae Hyun Hwang, Yanjie Fu, Wei Zhang
Title: Biological Pathway Guided Gene Selection Through Collaborative Reinforcement Learning
Abstract:
Gene selection in high-dimensional genomic data is essential for understanding disease mechanisms and improving therapeutic outcomes. Traditional feature selection methods effectively identify predictive genes but often ignore complex biological pathways and regulatory networks, leading to unstable and biologically irrelevant signatures. Prior approaches, such as Lasso-based methods and statistical filtering, either focus solely on individual gene-outcome associations or fail to capture pathway-level interactions, presenting a key challenge: how to integrate biological pathway knowledge while maintaining statistical rigor in gene selection? To address this gap, we propose a novel two-stage framework that integrates statistical selection with biological pathway knowledge using multi-agent reinforcement learning (MARL). First, we introduce a pathway-guided pre-filtering strategy that leverages multiple statistical methods alongside KEGG pathway information for initial dimensionality reduction. Next, for refined selection, we model genes as collaborative agents in a MARL framework, where each agent optimizes both predictive power and biological relevance. Our framework incorporates pathway knowledge through Graph Neural Network-based state representations, a reward mechanism combining prediction performance with gene centrality and pathway coverage, and collaborative learning strategies using shared memory and a centralized critic component. Extensive experiments on multiple gene expression datasets demonstrate that our approach significantly improves both prediction accuracy and biological interpretability compared to traditional methods.
Chinese Summary: 本研究提出了一种新颖的两阶段基因选择框架,通过多智能体强化学习将统计方法与生物通路知识相结合,显著提高了高维基因组数据的预测准确性和生物学可解释性。
English Summary: This study introduces a novel two-stage gene selection framework that integrates statistical methods with biological pathway knowledge using multi-agent reinforcement learning, significantly enhancing both prediction accuracy and biological interpretability in high-dimensional genomic data.

Authors:ChengAo Shen, Wenchao Yu, Ziming Zhao, Dongjin Song, Wei Cheng, Haifeng Chen, Jingchao Ni
Title: Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting
Abstract:
Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, applying LVMs to LTSF poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 state-of-the-art (SOTA) models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets.
中文: 本研究提出DMMV,一种基于分解的多模态框架,将时间序列转化为图像和文本,利用预训练大视觉模型进行长期预测,有效克服了归纳偏差,并在多数基准数据集上超越了14种现有先进模型。
English: The study introduces DMMV, a decomposition-based multi-modal framework that integrates time series as images and texts to leverage pre-trained large vision models for long-term forecasting, effectively addressing inductive bias and outperforming 14 state-of-the-art models on most benchmarks.

Authors:Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang
Title: SlimLLM: Accurate Structured Pruning for Large Language Models
Abstract:
Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.
Chinese: 本文提出SlimLLM,一种针对大型语言模型的高效结构化剪枝方法,通过整体评估子模块重要性并采用线性回归策略来减少性能损失,在LLaMA基准测试中取得了领先成果。
English: The paper introduces SlimLLM, an efficient structured pruning method for large language models that holistically assesses sub-module importance and employs a linear regression strategy to minimize performance loss, achieving state-of-the-art results on the LLaMA benchmark.

Authors:Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu
Title: Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency
Abstract:
Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience.
Chinese: 本研究提出环境鲁棒性语音编解码器基准(ERSB),用于评估神经语音编解码器在复杂声学环境中的表现,发现信号重建与下游任务一致性均显著下降。
English: The study introduces the Environment-Resilient Speech Codec Benchmark (ERSB) to evaluate neural speech codecs' performance in complex acoustic settings, revealing significant degradation in signal reconstruction and downstream task consistency.

Authors:Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, Bin Wang, Kaikai Song, Yifei Fu, Xu He, Yu Luo, Chong Zhu, Quan He, Xueyu Wu, Wei He, Hailin Hu, Yehui Tang, Dacheng Tao, Xinghao Chen, Yunhe Wang
Title: Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Abstract:
This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a "fast" mode for routine queries and a deeper "slow" mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.
中文:盘古嵌入式是一款高效的大型语言模型推理器,具备快慢双思维模式,通过两阶段训练框架在计算效率与推理质量上取得平衡,在多项基准测试中超越同规模模型。
English: Pangu Embedded is an efficient LLM reasoner with dual fast and slow thinking modes, achieving superior performance on benchmarks through a two-stage training framework that optimizes computational efficiency and reasoning quality.

Authors:Zhonghao Lyu, Yulan Gao, Junting Chen, Hongyang Du, Jie Xu, Kaibin Huang, Dong In Kim
Title: Empowering Intelligent Low-altitude Economy with Large AI Model Deployment
Abstract:
Low-altitude economy (LAE) represents an emerging economic paradigm that redefines commercial and social aerial activities. Large artificial intelligence models (LAIMs) offer transformative potential to further enhance the intelligence of LAE services. However, deploying LAIMs in LAE poses several challenges, including the significant gap between their computational/storage demands and the limited onboard resources of LAE entities, the mismatch between lab-trained LAIMs and dynamic physical environments, and the inefficiencies of traditional decoupled designs for sensing, communication, and computation. To address these issues, we first propose a hierarchical system architecture tailored for LAIM deployment and present representative LAE application scenarios. Next, we explore key enabling techniques that facilitate the mutual co-evolution of LAIMs and low-altitude systems, and introduce a task-oriented execution pipeline for scalable and adaptive service delivery. Then, the proposed framework is validated through real-world case studies. Finally, we outline open challenges to inspire future research.
Chinese: 低空经济作为一种新兴经济形态,通过大型人工智能模型提升智能化水平,但面临计算资源不足与环境适配等挑战,为此提出分层系统架构与协同演进技术,以实现可扩展的自适应服务。
English: The low-altitude economy is an emerging paradigm enhanced by large AI models, yet faces challenges like computational demands and environmental adaptability, which are addressed through a proposed hierarchical architecture and enabling techniques for scalable services.

Authors:Yifan Lu, Jing Li, Yigeng Zhou, Yihui Zhang, Wenya Wang, Xiucheng Li, Meishan Zhang, Fangming Liu, Jun Yu, Min Zhang
Title: Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing
Abstract:
Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs' general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.
Chinese: ToxEdit提出了一种毒性感知的知识编辑方法,通过动态检测毒性激活模式并自适应调整层间计算路径,在有效降低毒性的同时保持大语言模型的通用能力,在解毒效果和性能维护上均优于现有方法。
English: ToxEdit introduces a toxicity-aware knowledge editing method that dynamically identifies toxic activations and adaptively routes computations to effectively mitigate toxicity while preserving the general capabilities of large language models, outperforming existing approaches in both detoxification and maintaining performance.

Authors:Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen
Title: VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Abstract:
Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR.
Chinese: VietASR提出了一种利用大量未标记数据和少量标记数据进行自监督学习的低成本ASR训练流程,在越南语识别中性能超越Whisper等系统,并将开源以促进低资源ASR研究。
English: VietASR introduces a cost-effective ASR pipeline using self-supervised learning on extensive unlabeled data and minimal labeled data, outperforming existing systems like Whisper in Vietnamese while being open-sourced for research.

Authors:Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, Yunhe Wang
Title: Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
Abstract:
The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
Chinese: 分组专家混合(MoGE)通过将专家分组并确保组内均衡激活,改进了大型语言模型中的专家负载平衡,在昇腾NPU上显著提升了计算效率和吞吐量,Pangu Pro MoE模型的表现验证了其优越性。
English: Mixture of Grouped Experts (MoGE) improves expert load balancing in large language models by grouping experts and ensuring equal activation within groups, enhancing computational efficiency and throughput on Ascend NPUs, as demonstrated by the Pangu Pro MoE model's superior performance.

Authors:Peiming Guo, Meishan Zhang, Jianling Li, Min Zhang, Yue Zhang
Title: Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing
Abstract:
Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
中文摘要:本文提出了一种新颖的LLM反向生成方法,通过补全仅含领域关键词叶节点的部分树来自动生成跨领域选区树库,并结合对比学习预训练策略提升解析性能,在多个目标领域实现了最先进的平均结果。
English Summary: This paper introduces a novel LLM back generation method to automatically create cross-domain constituency treebanks by completing partial trees with domain keywords, combined with contrastive learning to enhance parsing performance, achieving state-of-the-art results across multiple domains.

Authors:Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, Huiling Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang
Title: Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs
Abstract:
Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.
Chinese: 本文提出了盘古Light框架,通过结构化剪枝与创新的权重重新初始化技术,有效缓解了大型语言模型压缩时的性能损失,在精度与效率的平衡上超越了现有基准方法。
English: This paper introduces Pangu Light, a structured pruning framework for Large Language Models that enhances post-compression performance through strategic weight re-initialization techniques, achieving superior accuracy-efficiency trade-offs compared to existing methods.

Authors:Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu
Title: Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
Abstract:
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
中文: 本文提出时间灵活编码技术,首次将可变帧率引入神经语音编解码器,通过基于时间熵的动态帧率分配,在较低帧率下仍保持优异的重建质量和灵活性。
English: This paper introduces Temporally Flexible Coding (TFC), a novel technique that incorporates variable frame rate into neural speech codecs to dynamically adjust frame rates based on temporal entropy, achieving superior reconstruction quality and flexibility even at lower frame rates.

Authors:Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Title: VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
Abstract:
The rapid advancement of large language models (LLMs) has accelerated the development of multimodal models capable of speech communications. Unlike text interactions, speech conveys diverse information, including acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the quality of their textual responses, overlooking critical aspects of vocal performance. To address this gap, we propose VocalBench, a comprehensive benchmark to assess the speech conversational abilities, comprising 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers a broad range of fundamental skills essential for effective vocal interactions. For the evaluation scheme, we propose several objective evaluation indicators and incorporate an additional LLM-as-a-judge approach to score open-ended questions. Experimental results on 15 mainstream systems reveal significant variability, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech interaction systems.
中文摘要:本研究提出了VocalBench这一包含9,400个实例的综合性基准,通过四个关键维度评估语音交互模型,弥补了现有评估的不足,并采用客观指标与LLM评判相结合的方法,揭示了15个主流系统存在的显著性能差异。
English Summary: The study introduces VocalBench, a comprehensive benchmark with 9,400 instances to evaluate speech interaction models across four dimensions, addressing limitations in current assessments and revealing significant performance variations among 15 mainstream systems through objective metrics and LLM-based evaluation.

Authors:Shiyao Cui, Qinglin Zhang, Xuan Ouyang, Renmiao Chen, Zhexin Zhang, Yida Lu, Hongning Wang, Han Qiu, Minlie Huang
Title: ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs
Abstract:
Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.
中文: 本研究针对多模态文本-图像内容中的隐性毒性检测难题,构建了系统的分类体系、包含2100个样本的数据集及ShieldVLM模型,该模型通过跨模态推理能有效识别隐性毒性内容。
English: The study addresses the challenge of multimodal implicit toxicity in text-image content by introducing a comprehensive taxonomy, a dataset of 2,100 items, and ShieldVLM, a model that excels in detecting such toxicity through cross-modal reasoning.

Authors:Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng, Zhexin Zhang, Shiyao Cui, Caishun Chen, Tiantian He, Hongning Wang, Yew-Soon Ong, Minlie Huang
Title: BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Abstract:
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.
Chinese: BARREL这一创新框架通过简洁且边界感知的推理,解决了大型推理模型过度自信和错误回答的问题,显著提升了事实可靠性,同时保持了与现有模型相当的准确性。
English: BARREL, a novel framework, enhances the reliability of Large Reasoning Models by addressing overconfident and incorrect answers through concise and boundary-aware reasoning, significantly improving factual accuracy without compromising performance.

Authors:Xu Zheng, Zhuomin Chen, Esteban Schafir, Sipeng Chen, Hojat Allah Salehi, Haifeng Chen, Farhad Shirani, Wei Cheng, Dongsheng Luo
Title: LM$^2$otifs : An Explainable Framework for Machine-Generated Texts Detection
Abstract:
The impressive ability of large language models to generate natural text across various tasks has led to critical challenges in authorship authentication. Although numerous detection methods have been developed to differentiate between machine-generated texts (MGT) and human-generated texts (HGT), the explainability of these methods remains a significant gap. Traditional explainability techniques often fall short in capturing the complex word relationships that distinguish HGT from MGT. To address this limitation, we present LM$^2$otifs, a novel explainable framework for MGT detection. Inspired by probabilistic graphical models, we provide a theoretical rationale for the effectiveness. LM$^2$otifs utilizes eXplainable Graph Neural Networks to achieve both accurate detection and interpretability. The LM$^2$otifs pipeline operates in three key stages: first, it transforms text into graphs based on word co-occurrence to represent lexical dependencies; second, graph neural networks are used for prediction; and third, a post-hoc explainability method extracts interpretable motifs, offering multi-level explanations from individual words to sentence structures. Extensive experiments on multiple benchmark datasets demonstrate the comparable performance of LM$^2$otifs. The empirical evaluation of the extracted explainable motifs confirms their effectiveness in differentiating HGT and MGT. Furthermore, qualitative analysis reveals distinct and visible linguistic fingerprints characteristic of MGT.
中文: LM²otifs框架通过图神经网络提取可解释的语言模式,有效解决了机器生成文本检测中的可解释性不足问题,揭示了区分人类与AI写作的语言特征指纹。
English: The LM²otifs framework addresses the explainability gap in machine-generated text detection by using graph neural networks to extract interpretable motifs that reveal linguistic fingerprints distinguishing human and AI writing.

Authors:Rui Liu, Rui Xie, Zijun Yao, Yanjie Fu, Dongjie Wang
Title: Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search
Abstract:
Feature selection removes redundant features to enhanc performance and computational efficiency in downstream tasks. Existing works often struggle to capture complex feature interactions and adapt to diverse scenarios. Recent advances in this domain have incorporated generative intelligence to address these drawbacks by uncovering intricate relationships between features. However, two key limitations remain: 1) embedding feature subsets in a continuous space is challenging due to permutation sensitivity, as changes in feature order can introduce biases and weaken the embedding learning process; 2) gradient-based search in the embedding space assumes convexity, which is rarely guaranteed, leading to reduced search effectiveness and suboptimal subsets. To address these limitations, we propose a new framework that can: 1) preserve feature subset knowledge in a continuous embedding space while ensuring permutation invariance; 2) effectively explore the embedding space without relying on strong convex assumptions. For the first objective, we develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space. This paradigm captures feature interactions through pairwise relationships within the subset, removing the influence of feature order on the embedding. Moreover, an inducing point mechanism is introduced to accelerate pairwise relationship computations. For the second objective, we employ a policy-based reinforcement learning (RL) approach to guide the exploration of the embedding space. The RL agent effectively navigates the space by balancing multiple objectives. By prioritizing high-potential regions adaptively and eliminating the reliance on convexity assumptions, the RL agent effectively reduces the risk of converging to local optima. Extensive experiments demonstrate the effectiveness, efficiency, robustness and explicitness of our model.
Chinese: 本文提出了一种新框架,通过保持特征子集排列不变性的嵌入方法和基于强化学习的非凸空间探索,克服了特征选择中的关键限制,实验验证了其高效性、鲁棒性和优越性能。
English: This paper introduces a novel framework that overcomes limitations in feature selection by embedding feature subsets with permutation invariance and using reinforcement learning to explore the embedding space without convexity assumptions, demonstrating superior performance in experiments.

Authors:Rui Liu, Rui Xie, Zijun Yao, Yanjie Fu, Dongjie Wang
Title: Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search
Abstract:
Feature selection removes redundant features to enhanc performance and computational efficiency in downstream tasks. Existing works often struggle to capture complex feature interactions and adapt to diverse scenarios. Recent advances in this domain have incorporated generative intelligence to address these drawbacks by uncovering intricate relationships between features. However, two key limitations remain: 1) embedding feature subsets in a continuous space is challenging due to permutation sensitivity, as changes in feature order can introduce biases and weaken the embedding learning process; 2) gradient-based search in the embedding space assumes convexity, which is rarely guaranteed, leading to reduced search effectiveness and suboptimal subsets. To address these limitations, we propose a new framework that can: 1) preserve feature subset knowledge in a continuous embedding space while ensuring permutation invariance; 2) effectively explore the embedding space without relying on strong convex assumptions. For the first objective, we develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space. This paradigm captures feature interactions through pairwise relationships within the subset, removing the influence of feature order on the embedding. Moreover, an inducing point mechanism is introduced to accelerate pairwise relationship computations. For the second objective, we employ a policy-based reinforcement learning (RL) approach to guide the exploration of the embedding space. The RL agent effectively navigates the space by balancing multiple objectives. By prioritizing high-potential regions adaptively and eliminating the reliance on convexity assumptions, the RL agent effectively reduces the risk of converging to local optima. Extensive experiments demonstrate the effectiveness, efficiency, robustness and explicitness of our model.
Chinese: 本文提出了一种新框架,通过保持特征子集排列不变性的嵌入方法和基于强化学习的非凸空间探索,克服了特征选择中的关键限制,实验验证了其高效性、鲁棒性和优越性能。
English: This paper introduces a novel framework that overcomes limitations in feature selection by embedding feature subsets with permutation invariance and using reinforcement learning to explore the embedding space without convexity assumptions, demonstrating superior performance in experiments.

Authors:Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Jingkai Sun, Jiahang Cao, Jiaxu Wang, Hao Cheng, Xiaozhu Ju, Zhengping Che, Renjing Xu, Jian Tang
Title: Occupancy World Model for Robots
Abstract:
Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.
中文: 本文提出RoboOccWorld框架,通过条件因果状态注意机制和混合时空聚合技术,实现了室内机器人场景中3D占据场景演化的精准预测,在重构的OccWorld-ScanNet基准测试中显著优于现有方法。
English: This paper introduces RoboOccWorld, a novel framework that employs a conditional causal state attention mechanism and hybrid spatio-temporal aggregation to forecast 3D occupancy scene evolutions in indoor robotics environments, outperforming existing methods on the reconstructed OccWorld-ScanNet benchmark.

Authors:Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu, Changzheng Zhang, Da Kuang, Fei Liu, Gang Huang, Jiansheng Wei, Jiarui Qin, Jie Ran, Jinpeng Li, Jun Zhao, Liang Dai, Lin Li, Liqun Deng, Peifeng Qin, Pengyuan Zeng, Qiang Gu, Shaohua Tang, Shengjun Cheng, Tao Gao, Tao Yu, Tianshu Li, Tianyu Bi, Wei He, Weikai Mao, Wenyong Huang, Wulong Liu, Xiabing Li, Xianzhi Yu, Xueyu Wu, Xu He, Yangkai Du, Yan Xu, Ye Tian, Yimeng Wu, Yongbing Huang, Yong Tian, Yong Zhu, Yue Li, Yufei Wang, Yuhang Gai, Yujun Li, Yu Luo, Yunsheng Ni, Yusen Sun, Zelin Chen, Zhe Liu, Zhicheng Liu, Zhipeng Tu, Zilin Ding, Zongyuan Zhan
Title: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
Abstract:
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
中文: 本文提出了一种在昇腾NPU上高效训练大规模稀疏混合专家语言模型的方法,通过系统优化和基于模拟的配置选择实现了高性能。
English: This paper presents a methodology for efficiently training large-scale sparse language models with Mixture of Experts on Ascend NPUs, achieving high performance through system optimizations and simulation-based configuration selection.

Authors:Haoyue Bai, Yiyou Sun, Wei Cheng, Haifeng Chen
Title: Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
Abstract:
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real world scenarios. In this work, we introduce a novel black box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt and recover strategy: by masking part of an image and assessing the model ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked image inputs, we incorporate a cost efficient surrogate model trained to align with the target model distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
Chinese: 本文提出了一种黑盒检测框架,通过损坏与恢复策略评估图像重建可能性来识别AI生成图像,无需模型权重或大型数据集,在八个扩散模型变体数据集上平均精度提升了4.31%。
English: This paper introduces a black-box detection framework that uses a corrupt-and-recover strategy to identify AI-generated images by assessing reconstruction likelihood, achieving a 4.31% improvement in mean average precision without requiring model weights or large datasets.

Authors:Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe
Title: Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC
Abstract:
Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR). However, these models struggle with limited resources during fine-tuning. This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. Furthermore, we employ data augmentation to mitigate performance gaps in few-shot settings and introduce LID Connectionist Temporal Classification (CTC) loss for regularization. Our approach achieves a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER over the baseline on ML-SUPERB 2.0, securing second place in the Interspeech 2025 ML-SUPERB 2.0 Challenge.
中文: 本文通过采用部分微调、数据增强等策略优化语音基础模型,在ML-SUPERB 2.0上显著提升了多语言语种识别和自动语音识别性能,并在Interspeech 2025挑战赛中荣获第二名。
English: This paper improves multilingual language identification and automatic speech recognition on ML-SUPERB 2.0 by adapting speech foundation models with strategies like partial fine-tuning and data augmentation, achieving significant performance gains and securing second place in the Interspeech 2025 challenge.

Authors:Jiadong Pan, Zhiyuan Ma, Kaiyan Zhang, Ning Ding, Bowen Zhou
Title: Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation
Abstract:
Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based iterative diffusion training, we introduce image reasoning through CoT into generation tasks adhering to physical laws and unconventional physical phenomena for the first time. Notably, experimental results of case study exhibit that the superior performance of our SRRL algorithm even compared with GPT-4o. The project page is https://jadenpan0.github.io/srrl.github.io/.
中文摘要:SRRL算法通过结合自反思强化学习和思维链推理,显著提升了扩散模型在逻辑图像生成中的表现,实验证明其效果甚至优于GPT-4o。
English Summary: The SRRL algorithm enhances diffusion models by integrating self-reflective reinforcement learning and Chain of Thought reasoning, enabling superior logical image generation that outperforms even GPT-4o in experimental evaluations.

Authors:Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp
Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
Abstract:
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
Chinese: TrojanStego提出了一种隐蔽威胁模型,攻击者通过微调大语言模型,利用语言隐写术将敏感信息嵌入自然文本输出,实现了高精度的数据泄露,同时保持文本效用并规避人工检测。
English: TrojanStego introduces a covert threat where adversaries fine-tune large language models to embed confidential data into seemingly normal outputs using linguistic steganography, enabling high-accuracy secret transmission while evading detection and preserving text utility.

Authors:Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp
Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
Abstract:
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
Chinese: TrojanStego提出了一种隐蔽威胁模型,攻击者通过微调大语言模型,利用语言隐写术将敏感信息嵌入自然文本输出,实现了高精度的数据泄露,同时保持文本效用并规避人工检测。
English: TrojanStego introduces a covert threat where adversaries fine-tune large language models to embed confidential data into seemingly normal outputs using linguistic steganography, enabling high-accuracy secret transmission while evading detection and preserving text utility.

Authors:Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
Title: Two-Stage Regularization-Based Structured Pruning for LLMs
Abstract:
The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.
中文: TRSP是一种新颖的两阶段正则化结构化剪枝方法,通过双重正则化机制在减少大语言模型参数的同时保留知识、维持性能且无需重训练,实现了显著加速的高效部署。
English: TRSP is a novel two-stage regularization-based structured pruning method that reduces LLM parameters while minimizing knowledge loss and preserving performance without requiring retraining, enabling efficient deployment with notable acceleration.

Authors:Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen
Title: Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling
Abstract:
Safe and feasible trajectory planning is essential for real-world autonomous driving systems. However, existing learning-based planning methods often rely on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting unsafe behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a novel two-stage trajectory planning framework that formulates trajectory planning as a sequential prediction task, guided by explicit planning principles such as safety, comfort, and traffic rule compliance. In the first stage, we train an autoregressive trajectory predictor via next motion token prediction on expert data. In the second stage, we design rule-based rewards (e.g., collision avoidance, speed limits) and fine-tune the model using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy, to align its predictions with these planning principles. Experiments on the nuPlan benchmark demonstrate that our Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance. Our code will be made public soon.
中文: 针对现有基于学习的规划器依赖专家数据且存在安全风险的问题,Plan-R1提出两阶段框架:先通过人类数据预训练轨迹预测器,再采用VD-GRPO进行对齐安全与交规的微调,在nuPlan基准测试中实现了最优性能。
English: To address the limitations of existing learning-based planners that rely on expert data with potential safety risks, Plan-R1 introduces a two-stage framework that first pre-trains a trajectory predictor on human data and then fine-tunes it using VD-GRPO to align with safety and traffic rules, achieving superior performance on the nuPlan benchmark.

Authors:Lars Benedikt Kaesberg, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: SPaRC: A Spatial Pathfinding Reasoning Challenge
Abstract:
Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (>50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.
中文:SPaRC数据集包含1000个二维网格寻路谜题,用于评估空间推理能力,结果显示人类准确率接近完美,而当前最佳AI模型如o4-mini表现严重不佳,尤其在复杂问题上,凸显了它们在导航和空间逻辑方面的缺陷。
English: The SPaRC dataset introduces 1,000 2D grid pathfinding puzzles to evaluate spatial reasoning, revealing that while humans excel with near-perfect accuracy, current AI models like o4-mini struggle significantly, especially on complex problems, highlighting their limitations in navigation and spatial logic.

Authors:Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He
Title: Sparse Activation Editing for Reliable Instruction Following in Narratives
Abstract:
Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
中文:提出的Concise-SAE框架通过仅使用自然语言识别和编辑相关神经元来增强指令遵循能力,同时FreeInstruct基准测试评估了其在多样化叙事场景中的有效性。
English: The proposed Concise-SAE framework enhances instruction following by identifying and editing relevant neurons using only natural language, while the FreeInstruct benchmark evaluates its effectiveness across diverse narrative contexts.

Authors:Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Title: Large Language Models based ASR Error Correction for Child Conversations
Abstract:
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
中文: 大语言模型在纠正儿童对话语音的自动语音识别错误方面展现出潜力,能有效改进零样本和基于CTC模型的输出,但在整合上下文信息或处理微调自回归模型(如Whisper)时仍面临挑战。
English: Large Language Models show promise in correcting ASR errors for conversational child speech, effectively improving zero-shot and CTC-based outputs but struggling with contextual integration and fine-tuned autoregressive models like Whisper.

Authors:Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao
Title: Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
Abstract:
Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.
中文摘要:TAPO是一种新颖的强化学习框架,通过融入外部思维模式来增强推理模型,在多个基准测试中实现显著性能提升,同时提高了推理行为的可解释性和输出可读性。
English Summary: TAPO is a novel reinforcement learning framework that enhances reasoning models by integrating external thought patterns, achieving significant performance improvements across multiple benchmarks while improving explainability and readability.

Authors:Hendrik Junkawitsch, Guoxing Sun, Heming Zhu, Christian Theobalt, Marc Habermann
Title: EVA: Expressive Virtual Avatars from Multi-view Videos
Abstract:
With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.
中文: 神经渲染与动作捕捉的进展推动了逼真人像建模,但现有方法无法完全控制表情与动作;我们提出的EVA框架通过分层模型实现面部表情、身体运动和手势的独立操控,以高保真实时渲染突破了这一局限。
English: Recent advances in neural rendering and motion capture have enabled photorealistic human avatars, but existing methods lack full control over expressions and movements; our proposed EVA framework overcomes this by providing independent control of facial expressions, body motions, and hand gestures with high-fidelity real-time rendering.

Authors:Xin Li, Mengbing Liu, Li Wei, Jiancheng An, Mérouane Debbah, Chau Yuen
Title: WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications
Abstract:
Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.
中文: 本研究提出了WirelessMathBench这一专门评估大语言模型在无线通信领域数学推理能力的基准,揭示了尽管模型在基础记忆任务中表现良好,但在复杂方程重建方面存在显著不足。
English: This study introduces WirelessMathBench, a specialized benchmark for evaluating LLMs' mathematical reasoning in wireless communications, revealing their significant limitations in complex equation tasks despite basic recall proficiency.

Authors:Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Danai Koutra
Title: On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective
Abstract:
Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables. However, as embedding tables grow, closed-form solutions become impractical, often necessitating the use of mini-batch gradient descent for training. Despite extensive work on designing loss functions to train CF models, we argue that one core component of these pipelines is heavily overlooked: weight decay. Attaining high-performing models typically requires careful tuning of weight decay, regardless of loss, yet its necessity is not well understood. In this work, we question why weight decay is crucial in CF pipelines and how it impacts training. Through theoretical and empirical analysis, we surprisingly uncover that weight decay's primary function is to encode popularity information into the magnitudes of the embedding vectors. Moreover, we find that tuning weight decay acts as a coarse, non-linear knob to influence preference towards popular or unpopular items. Based on these findings, we propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes), a straightforward yet effective solution to simplify the training of high-performing CF models. PRISM pre-encodes the popularity information typically learned through weight decay, eliminating its necessity. Our experiments show that PRISM improves performance by up to 4.77% and reduces training times by 38.48%, compared to state-of-the-art training strategies. Additionally, we parameterize PRISM to modulate the initialization strength, offering a cost-effective and meaningful strategy to mitigate popularity bias.
中文: 本研究揭示了在协同过滤中权重衰减的核心作用是编码物品流行度到嵌入向量中,并提出PRISM方法通过预编码流行度信息来替代权重衰减,将性能提升最高达4.77%并减少38.48%的训练时间。
English: This research reveals that weight decay in collaborative filtering primarily encodes item popularity into embedding magnitudes and introduces PRISM, a method that pre-encodes popularity to eliminate the need for weight decay, improving performance by up to 4.77% and reducing training time by 38.48%.

Authors:Anchen Sun, Tiantian Feng, Gabriela Gutierrez, Juan J Londono, Anfeng Xu, Batya Elbaum, Shrikanth Narayanan, Lynn K Perry, Daniel S Messinger
Title: Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech
Abstract:
This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms, enhancing both accuracy and scalability through the integration of wav2vec2-based speaker classification and Whisper (large-v2 and large-v3) speech transcription. A total of 235 minutes of audio recordings (160 minutes from 12 children and 75 minutes from 5 teachers), were used to compare system outputs to expert human annotations. WSW2.0 achieves a weighted F1 score of .845, accuracy of .846, and an error-corrected kappa of .672 for speaker classification (child vs. teacher). Transcription quality is moderate to high with word error rates of .119 for teachers and .238 for children. WSW2.0 exhibits relatively high absolute agreement intraclass correlations (ICC) with expert transcriptions for a range of classroom language features. These include teacher and child mean utterance length, lexical diversity, question asking, and responses to questions and other utterances, which show absolute agreement intraclass correlations between .64 and .98. To establish scalability, we apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings, demonstrating the framework's robustness for broad real-world applications. These findings highlight the potential of deep learning and natural language processing techniques to revolutionize educational research by providing accurate measures of key features of preschool classroom speech, ultimately guiding more effective intervention strategies and supporting early childhood language development.
中文: 本文提出的WSW2.0自动化框架通过整合wav2vec2说话人分类和Whisper语音转录技术,能精准分析学前课堂语音互动,在说话人分类和转录质量方面表现优异,并成功应用于大规模实际数据验证其扩展性。
English: This paper presents WSW2.0, an automated framework that combines wav2vec2 and Whisper models to accurately analyze preschool classroom vocal interactions, achieving strong performance in speaker classification and speech transcription while demonstrating scalability across extensive real-world datasets.

Authors:Enpei Zhang, Jingyi Chai, Rui Ye, Yanfeng Wang, Siheng Chen
Title: Incentivizing Inclusive Contributions in Model Sharing Markets
Abstract:
While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models' instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied.
中文摘要:本文提出包容性激励个性化联邦学习(iPFL),通过构建基于图优化的模型共享市场和博弈论激励机制,在保护隐私的前提下激励数据持有者协同训练个性化模型,实证证明其具有最优经济效用和可比模型性能。
English Summary: This paper introduces inclusive and incentivized personalized federated learning (iPFL), a framework that enables collaborative training of personalized models on decentralized private data through game-theoretic incentives while preserving privacy and achieving superior economic utility and model performance.

Authors:Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji
Title: Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
Abstract:
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: \(\scoreGamma\) measures basic reasoning accuracy, while \(\scoreDelta\) quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) \(\scoreDelta\)'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.
中文: 本文通过系统符号重映射建立理论基准来评估大语言模型的抽象推理能力,揭示了非十进制运算的关键缺陷和思维链提示下仍存在的抽象推理差距。
English: This paper establishes a theoretical benchmark to evaluate abstract reasoning in LLMs through systematic symbol remapping, revealing critical limitations in non-decimal arithmetic and persistent abstraction gaps despite chain-of-thought prompting.

Authors:Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou
Title: OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software
Abstract:
Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software's usability. Additionally, our demonstration showcases OSS-UAgent's practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.
中文摘要:针对传统开源软件可用性评估方法成本高、可扩展性差的问题,我们开发了OSS-UAgent自动化框架,该框架通过大语言模型驱动的智能代理模拟不同水平开发者,并基于多维度代码分析实现系统性可用性评估。
English Summary: To overcome the high cost and limited scalability of traditional usability evaluation methods for open source software, we developed OSS-UAgent, an automated framework using large language model-powered agents to simulate developers at various skill levels and assess software usability through multi-dimensional code analysis.

Authors:Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji
Title: Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs
Abstract:
Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.
中文: 提出的上下文到线索直接偏好优化(CcDPO)框架通过上下文级和细节级优化解决多模态大语言模型在多图像理解中的挑战,显著减少幻觉现象,同时在单图像和多图像任务中均实现性能提升。
English: The proposed Context-to-Cue Direct Preference Optimization (CcDPO) framework addresses multi-image understanding challenges in MLLMs through context-level and needle-level optimizations, significantly reducing hallucinations while improving performance across both single- and multi-image tasks.

Authors:Haitao Lin, Odin Zhang, Jia Xu, Yunfan Liu, Zheng Cheng, Lirong Wu, Yufei Huang, Zhifeng Gao, Stan Z. Li
Title: Tokenizing Electron Cloud in Protein-Ligand Interaction Learning
Abstract:
The affinity and specificity of protein-molecule binding directly impact functional outcomes, uncovering the mechanisms underlying biological regulation and signal transduction. Most deep-learning-based prediction approaches focus on structures of atoms or fragments. However, quantum chemical properties, such as electronic structures, are the key to unveiling interaction patterns but remain largely underexplored. To bridge this gap, we propose ECBind, a method for tokenizing electron cloud signals into quantized embeddings, enabling their integration into downstream tasks such as binding affinity prediction. By incorporating electron densities, ECBind helps uncover binding modes that cannot be fully represented by atom-level models. Specifically, to remove the redundancy inherent in electron cloud signals, a structure-aware transformer and hierarchical codebooks encode 3D binding sites enriched with electron structures into tokens. These tokenized codes are then used for specific tasks with labels. To extend its applicability to a wider range of scenarios, we utilize knowledge distillation to develop an electron-cloud-agnostic prediction model. Experimentally, ECBind demonstrates state-of-the-art performance across multiple tasks, achieving improvements of 6.42\% and 15.58\% in per-structure Pearson and Spearman correlation coefficients, respectively.
中文: ECBind提出了一种将电子云信号转化为量化嵌入的新方法,通过整合量子化学特性来提升蛋白质-分子结合亲和力的预测性能,弥补了传统原子级模型的不足。
English: ECBind introduces a novel method for tokenizing electron cloud signals into quantized embeddings, enabling enhanced prediction of protein-molecule binding affinity by incorporating quantum chemical properties that traditional atom-level models overlook.

Authors:Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi
Title: GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning
Abstract:
Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
中文: GenPO提出了一种生成式策略优化框架,通过精确扩散反演和双重虚拟动作机制解决了状态-动作对数似然计算难题,首次成功将扩散策略整合到同策略强化学习中,在八项机器人基准测试中展现出卓越性能。
English: GenPO introduces a generative policy optimization framework that enables the integration of diffusion policies into on-policy reinforcement learning by solving the challenge of computing state-action log-likelihoods through exact diffusion inversion and a doubled dummy action mechanism, demonstrating superior performance across eight robotic benchmarks.

Authors:Jiancheng Wang, Mingjia Yin, Hao Wang, Enhong Chen
Title: Enhancing CTR Prediction with De-correlated Expert Networks
Abstract:
Modeling feature interactions is essential for accurate click-through rate (CTR) prediction in advertising systems. Recent studies have adopted the Mixture-of-Experts (MoE) approach to improve performance by ensembling multiple feature interaction experts. These studies employ various strategies, such as learning independent embedding tables for each expert or utilizing heterogeneous expert architectures, to differentiate the experts, which we refer to expert de-correlation. However, it remains unclear whether these strategies effectively achieve de-correlated experts. To address this, we propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.Additionally, we propose a novel metric, termed Cross-Expert Correlation, to quantitatively evaluate the expert de-correlation degree. Based on this metric, we identify a key finding for MoE framework design: different de-correlation strategies are mutually compatible, and progressively employing them leads to reduced correlation and enhanced performance. Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle. Moreover, online A/B testing on Tencent's advertising platforms demonstrates that D-MoE achieves a significant 1.19% Gross Merchandise Volume (GMV) lift compared to the Multi-Embedding MoE baseline.
中文: 提出的去相关混合专家(D-MoE)框架通过引入交叉专家去相关损失和新评估指标,有效降低了专家模型间的相关性,实验和在线测试均证明其能显著提升点击率预测性能和商业指标。
English: The proposed De-Correlated MoE (D-MoE) framework introduces a specialized loss function and metric to minimize expert correlations in CTR prediction, with experiments and online tests confirming its effectiveness in reducing correlation and boosting performance.

Authors:Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao
Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
Abstract:
Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
Chinese: O²-Searcher是一种基于强化学习的新型搜索代理,通过模拟搜索环境动态获取外部知识,有效解决了开放域中的开放性和封闭性问题,仅使用30亿参数模型就在多个基准测试中实现了最先进的性能。
English: O²-Searcher is a novel reinforcement learning-based search agent that effectively addresses both open-ended and closed-ended questions by dynamically acquiring external knowledge through a simulated search environment, achieving state-of-the-art performance across multiple benchmarks while using only a 3B parameter model.

Authors:Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Yixing Fan
Title: Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems
Abstract:
Retrieval-augmented generation (RAG) systems can effectively mitigate the hallucination problem of large language models (LLMs),but they also possess inherent vulnerabilities. Identifying these weaknesses before the large-scale real-world deployment of RAG systems is of great importance, as it lays the foundation for building more secure and robust RAG systems in the future. Existing adversarial attack methods typically exploit knowledge base poisoning to probe the vulnerabilities of RAG systems, which can effectively deceive standard RAG models. However, with the rapid advancement of deep reasoning capabilities in modern LLMs, previous approaches that merely inject incorrect knowledge are inadequate when attacking RAG systems equipped with deep reasoning abilities. Inspired by the deep thinking capabilities of LLMs, this paper extracts reasoning process templates from R1-based RAG systems, uses these templates to wrap erroneous knowledge into adversarial documents, and injects them into the knowledge base to attack RAG systems. The key idea of our approach is that adversarial documents, by simulating the chain-of-thought patterns aligned with the model's training signals, may be misinterpreted by the model as authentic historical reasoning processes, thus increasing their likelihood of being referenced. Experiments conducted on the MS MARCO passage ranking dataset demonstrate the effectiveness of our proposed method.
中文: 本文提出了一种新颖的对抗攻击方法,通过模拟真实思维链模式利用推理过程模板欺骗检索增强生成系统,实验证明该方法对具备深度推理能力的先进模型具有显著攻击效果。
English: This paper introduces a novel adversarial attack method that exploits reasoning process templates to deceive retrieval-augmented generation systems by mimicking authentic chain-of-thought patterns, demonstrating effectiveness against advanced models with deep reasoning capabilities.

Authors:Liangxuan Wu, Chao Wang, Tianming Liu, Yanjie Zhao, Haoyu Wang
Title: From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents
Abstract:
The growing adoption of large language models (LLMs) has led to a new paradigm in mobile computing--LLM-powered mobile AI agents--capable of decomposing and automating complex tasks directly on smartphones. However, the security implications of these agents remain largely unexplored. In this paper, we present the first comprehensive security analysis of mobile LLM agents, encompassing three representative categories: System-level AI Agents developed by original equipment manufacturers (e.g., YOYO Assistant), Third-party Universal Agents (e.g., Zhipu AI AutoGLM), and Emerging Agent Frameworks (e.g., Alibaba Mobile Agent). We begin by analyzing the general workflow of mobile agents and identifying security threats across three core capability dimensions: language-based reasoning, GUI-based interaction, and system-level execution. Our analysis reveals 11 distinct attack surfaces, all rooted in the unique capabilities and interaction patterns of mobile LLM agents, and spanning their entire operational lifecycle. To investigate these threats in practice, we introduce AgentScan, a semi-automated security analysis framework that systematically evaluates mobile LLM agents across all 11 attack scenarios. Applying AgentScan to nine widely deployed agents, we uncover a concerning trend: every agent is vulnerable to targeted attacks. In the most severe cases, agents exhibit vulnerabilities across eight distinct attack vectors. These attacks can cause behavioral deviations, privacy leakage, or even full execution hijacking. Based on these findings, we propose a set of defensive design principles and practical recommendations for building secure mobile LLM agents. Our disclosures have received positive feedback from two major device vendors. Overall, this work highlights the urgent need for standardized security practices in the fast-evolving landscape of LLM-driven mobile automation.
中文: 本研究首次对移动端大语言模型智能体进行系统性安全分析,通过AgentScan框架识别出11个攻击面并证实所有测试代理均存在漏洞,同时提出相应防御措施以应对安全风险。
English: This study presents the first comprehensive security analysis of mobile LLM agents, identifying 11 attack surfaces and demonstrating vulnerabilities across all tested agents through the AgentScan framework, while proposing defensive measures to address these risks.

Authors:Minghan Chen, Guikun Chen, Wenguan Wang, Yi Yang
Title: SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization
Abstract:
Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM's uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model's knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs' uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables dynamic adjustment of policy update magnitudes based on question uncertainty. It allows more conservative updates on high-uncertainty questions while maintaining the original learning signal on confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new state-of-the-art performance in average accuracy, validating the effectiveness of uncertainty-aware policy optimization.
中文:SEED-GRPO通过语义熵量化大语言模型对提示的不确定性,实现基于问题置信度的动态策略优化,在数学推理基准测试中取得了最优性能。
English: SEED-GRPO introduces semantic entropy to measure LLM uncertainty in prompts, enabling dynamic policy updates that improve performance across mathematical reasoning benchmarks.

Authors:Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro
Title: Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Abstract:
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
中文摘要:DCASE 2025挑战赛第五任务推出音频问答基准,通过三个专业子集评估音频语言模型在多领域声学场景中的交互推理能力,采用跨领域数据集和基线系统推动模型实现接近人类水平的音频理解。
English Summary: DCASE 2025 Challenge Task 5 introduces an Audio Question Answering benchmark with three specialized subsets to evaluate audio-language models' reasoning across diverse acoustic domains, using multi-domain datasets and baseline systems to advance toward human-level audio understanding.

Authors:Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng
Title: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
Abstract:
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.
中文: 本文提出了OmniGIRL,一个多语言、多模态、多领域的GitHub问题解决基准,揭示了当前大语言模型在处理多样化和包含图像的问题时的局限性。
English: This paper introduces OmniGIRL, a multilingual, multimodal, and multi-domain benchmark for GitHub issue resolution, which reveals the limitations of current LLMs in handling diverse and image-inclusive issues.

Authors:Hanxiang Xu, Yanjie Zhao, Haoyu Wang
Title: Directed Greybox Fuzzing via Large Language Model
Abstract:
Directed greybox fuzzing (DGF) focuses on efficiently reaching specific program locations or triggering particular behaviors, making it essential for tasks like vulnerability detection and crash reproduction. However, existing methods often suffer from path explosion and randomness in input mutation, leading to inefficiencies in exploring and exploiting target paths. In this paper, we propose HGFuzzer, an automatic framework that leverages the large language model (LLM) to address these challenges. HGFuzzer transforms path constraint problems into targeted code generation tasks, systematically generating test harnesses and reachable inputs to reduce unnecessary exploration paths significantly. Additionally, we implement custom mutators designed specifically for target functions, minimizing randomness and improving the precision of directed fuzzing. We evaluated HGFuzzer on 20 real-world vulnerabilities, successfully triggering 17, including 11 within the first minute, achieving a speedup of at least 24.8x compared to state-of-the-art directed fuzzers. Furthermore, HGFuzzer discovered 9 previously unknown vulnerabilities, all of which were assigned CVE IDs, demonstrating the effectiveness of our approach in identifying real-world vulnerabilities.
Chinese: HGFuzzer 提出了一种基于大语言模型的框架,将路径约束转化为代码生成任务,并采用定制化变异器提升定向模糊测试效率,在真实漏洞测试中实现了显著加速并发现了新的安全漏洞。
English: HGFuzzer introduces an LLM-powered framework that converts path constraints into code generation tasks and employs custom mutators to enhance directed fuzzing efficiency, achieving significant speed improvements and uncovering new vulnerabilities in real-world applications.

Authors:Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang
Title: MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
Abstract:
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
Chinese: 提出的基于多智能体协作的组合扩散(MCCD)框架通过多智能体场景解析和分层组合扩散技术,无需额外训练即可显著提升基线模型性能,精准生成包含多对象与复杂关系的场景。
English: The proposed Multi-agent Collaboration-based Compositional Diffusion (MCCD) framework enhances complex scene generation by employing multi-agent scene parsing and hierarchical compositional diffusion to accurately render multiple objects and relations while significantly improving baseline model performance without additional training.

Authors:Xinyi Hou, Jiahao Han, Yanjie Zhao, Haoyu Wang
Title: Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
Abstract:
Large language models (LLMs) are increasingly deployed through open-source and commercial frameworks, enabling individuals and organizations to self-host advanced LLM capabilities. As LLM deployments become prevalent, particularly in industry, ensuring their secure and reliable operation has become a critical issue. However, insecure defaults and misconfigurations often expose LLM services to the public internet, posing serious security and system engineering risks. This study conducted a large-scale empirical investigation of public-facing LLM deployments, focusing on the prevalence of services, exposure characteristics, systemic vulnerabilities, and associated risks. Through internet-wide measurements, we identified 320,102 public-facing LLM services across 15 frameworks and extracted 158 unique API endpoints, categorized into 12 functional groups based on functionality and security risk. Our analysis found that over 40% of endpoints used plain HTTP, and over 210,000 endpoints lacked valid TLS metadata. API exposure was highly inconsistent: some frameworks, such as Ollama, responded to over 35% of unauthenticated API requests, with about 15% leaking model or system information, while other frameworks implemented stricter controls. We observed widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. These security risks, such as model leakage, system compromise, and unauthorized access, are pervasive and highlight the need for a secure-by-default framework and stronger deployment practices.
中文摘要:大型语言模型部署因不安全的默认设置和配置错误面临严重安全风险,普遍存在未加密HTTP使用、TLS配置不当和未授权访问等问题,导致模型泄露和系统受损风险广泛存在。
English Summary: Large language model deployments face significant security risks due to insecure defaults and misconfigurations, with widespread issues including unencrypted HTTP usage, poor TLS configurations, and unauthenticated access exposing services to model leakage and system compromise.

Authors:Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian Liu, Haoyuan Zhang, Zhen Lei
Title: MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution
Abstract:
Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.
中文: 本文提出VLF-FFD这一新颖的视觉语言融合方案,通过构建带文本标注的数据集EFF++和双向交互网络VLF-Net来增强多模态大语言模型的伪造人脸检测能力,实现了最先进的检测性能。
English: This paper introduces VLF-FFD, a novel vision-language fusion method for face forgery detection that enhances multimodal large language models through a new annotated dataset EFF++ and a bidirectional interaction network VLF-Net, achieving state-of-the-art performance.

Authors:Onur Mutlu, Ataberk Olgun, Ismail Emir Yuksel
Title: Memory-Centric Computing: Solving Computing's Memory Problem
Abstract:
Computing has a huge memory problem. The memory system, consisting of multiple technologies at different levels, is responsible for most of the energy consumption, performance bottlenecks, robustness problems, monetary cost, and hardware real estate of a modern computing system. All this becomes worse as modern and emerging applications become more data-intensive (as we readily witness in e.g., machine learning, genome analysis, graph processing, and data analytics), making the memory system an even larger bottleneck. In this paper, we discuss two major challenges that greatly affect computing system performance and efficiency: 1) memory technology & capacity scaling (at the lower device and circuit levels) and 2) system and application performance & energy scaling (at the higher levels of the computing stack). We demonstrate that both types of scaling have become extremely difficult, wasteful, and costly due to the dominant processor-centric design & execution paradigm of computers, which treats memory as a dumb and inactive component that cannot perform any computation. We show that moving to a memory-centric design & execution paradigm can solve the major challenges, while enabling multiple other potential benefits. In particular, we demonstrate that: 1) memory technology scaling problems (e.g., RowHammer, RowPress, Variable Read Disturbance, data retention, and other issues awaiting to be discovered) can be much more easily and efficiently handled by enabling memory to autonomously manage itself; 2) system and application performance & energy efficiency can, at the same time, be improved by orders of magnitude by enabling computation capability in memory chips and structures (i.e., processing in memory). We discuss adoption challenges against enabling memory-centric computing, and describe how we can get there step-by-step via an evolutionary path.
中文: 现代计算面临严重的内存瓶颈,内存系统消耗大部分能源并制约性能,但转向内存中心范式、支持内存内计算能有效解决这些问题,并大幅提升系统效率。
English: Modern computing faces a critical memory bottleneck, where the memory system consumes most energy and limits performance, but shifting to a memory-centric paradigm that enables in-memory processing can resolve these issues and unlock significant efficiency gains.

Authors:Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Bin Fu, Pinlong Cai, Licheng Wen, Botian Shi, Yong Liu, Xinyu Cai, Yu Qiao
Title: GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
Abstract:
The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.
中文: GDI-Bench作为包含2.3千张图像和19个文档任务的综合基准,旨在评估多模态模型能力并识别其弱点,而提出的GDI模型通过智能保留训练策略实现了最先进的性能。
English: The GDI-Bench is introduced as a comprehensive benchmark with 2.3k images across 19 document tasks to evaluate multimodal models' capabilities and identify weaknesses, while the proposed GDI-Model achieves state-of-the-art performance through an intelligence-preserving training strategy.

Authors:Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Hyperbolic Dataset Distillation
Abstract:
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.
中文: 针对现有欧几里得空间数据集蒸馏方法忽略数据层次结构的局限,HDD引入双曲空间嵌入,通过测地距离对齐合成与原始数据分布,有效保留几何关系,实现大幅数据集压缩且性能损失极小。
English: To overcome the limitations of Euclidean-based dataset distillation methods that ignore hierarchical structures, HDD introduces hyperbolic space embedding to align synthetic and original data distributions using geodesic distance, effectively preserving geometric relationships and enabling significant dataset compression with minimal performance loss.

Authors:Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen, Yong Li
Title: KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval
Abstract:
Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbf{K-EVER\textsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf{19\% accuracy gain} for specific emotions and a \textbf{12.3\% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.
中文: 提出的K-EVER²框架通过整合结构化情感知识和多模态对齐,解决了视觉情感分析中的关键难题,在无需直接情感监督的情况下实现了显著的准确率提升。
English: The proposed K-EVER² framework addresses visual emotion analysis challenges by integrating structured affective knowledge and multimodal alignment, achieving significant accuracy improvements without direct emotion supervision.

Authors:Fanhang Man, Huandong Wang, Jianjie Fang, Zhaoyi Deng, Baining Zhao, Xinlei Chen, Yong Li
Title: Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents
Abstract:
User sentiment on social media reveals the underlying social trends, crises, and needs. Researchers have analyzed users' past messages to trace the evolution of sentiments and reconstruct sentiment dynamics. However, predicting the imminent sentiment of an ongoing event is rarely studied. In this paper, we address the problem of \textbf{sentiment forecasting} on social media to predict the user's future sentiment in response to the development of the event. We extract sentiment-related features to enhance the modeling skill and propose a multi-perspective role-playing framework to simulate the process of human response. Our preliminary results show significant improvement in sentiment forecasting on both microscopic and macroscopic levels.
中文摘要:本文提出一种社交媒体情感预测方法,通过提取情感相关特征并采用多视角角色扮演框架来模拟人类响应过程,在微观和宏观层面均显著提升了对于持续事件中用户未来情感的预测能力。
English Summary: This paper introduces a sentiment forecasting method for social media that uses sentiment-related features and a multi-perspective role-playing framework to predict users' future sentiments about ongoing events, showing significant improvements at both micro and macro levels.

Authors:Whenty Ariyanti, Kuan-Yu Chen, Sabato Marco Siniscalchi, Hsin-Min Wang, Yu Tsao
Title: Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations
Abstract:
Perceptual voice quality assessment is essential for diagnosing and monitoring voice disorders by providing standardized evaluations of vocal function. Traditionally, expert raters use standard scales such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS). However, these metrics are subjective and prone to inter-rater variability, motivating the need for automated, objective assessment methods. This study proposes Voice Quality Assessment Network (VOQANet), a deep learning-based framework with an attention mechanism that leverages a Speech Foundation Model (SFM) to extract high-level acoustic and prosodic information from raw speech. To enhance robustness and interpretability, we also introduce VOQANet+, which integrates low-level speech descriptors such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with SFM embeddings into a hybrid representation. Unlike prior studies focused only on vowel-based phonation (PVQD-A subset) of the Perceptual Voice Quality Dataset (PVQD), we evaluate our models on both vowel-based and sentence-level speech (PVQD-S subset) to improve generalizability. Results show that sentence-based input outperforms vowel-based input, especially at the patient level, underscoring the value of longer utterances for capturing perceptual voice attributes. VOQANet consistently surpasses baseline methods in root mean squared error (RMSE) and Pearson correlation coefficient (PCC) across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even better performance. Additional experiments under noisy conditions show that VOQANet+ maintains high prediction accuracy and robustness, supporting its potential for real-world and telehealth deployment.
中文摘要:本研究提出的VOQANet及其增强版VOQANet+,通过结合语音基础模型与声学特征实现了感知语音质量评估的自动化,在元音和句子级语音数据上均展现出优于传统方法的准确性和鲁棒性。
English Summary: This study introduces VOQANet and its enhanced version VOQANet+, deep learning frameworks that combine speech foundation models with acoustic features to automate perceptual voice quality assessment, demonstrating superior accuracy and robustness over traditional methods on both vowel and sentence-level speech data.

Authors:Junyan Zhang, Yubo Gao, Yibo Yan, Jungang Li, Zhaorui Hou, Sicheng Tao, Shuliang Liu, Song Dai, Yonghua Hei, Junzhuo Li, Xuming Hu
Title: Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities
Abstract:
The finetuning of Large Language Models (LLMs) has significantly advanced their instruction-following capabilities, yet the underlying computational mechanisms driving these improvements remain poorly understood. This study systematically examines how fine-tuning reconfigures LLM computations by isolating and analyzing instruction-specific sparse components, i.e., neurons in dense models and both neurons and experts in Mixture-of-Experts (MoE) architectures. In particular, we introduce HexaInst, a carefully curated and balanced instructional dataset spanning six distinct categories, and propose SPARCOM, a novel analytical framework comprising three key contributions: (1) a method for identifying these sparse components, (2) an evaluation of their functional generality and uniqueness, and (3) a systematic comparison of their alterations. Through experiments, we demonstrate functional generality, uniqueness, and the critical role of these components in instruction execution. By elucidating the relationship between fine-tuning-induced adaptations and sparse computational substrates, this work provides deeper insights into how LLMs internalize instruction-following behavior for the trustworthy LLM community.
中文: 本研究通过分析稀疏计算组件,探究微调如何提升大语言模型的指令遵循能力,引入HexaInst数据集和SPARCOM框架,揭示了这些组件的功能作用与适应性变化。
English: This research investigates how fine-tuning enhances instruction-following in Large Language Models by analyzing sparse computational components, introducing the HexaInst dataset and SPARCOM framework to reveal their functional roles and adaptations.

Authors:Xihuan Lin, Jie Zhang, Gelei Deng, Tianzhe Liu, Xiaolong Liu, Changcai Yang, Tianwei Zhang, Qing Guo, Riqing Chen
Title: IRCopilot: Automated Incident Response with Large Language Models
Abstract:
Incident response plays a pivotal role in mitigating the impact of cyber attacks. In recent years, the intensity and complexity of global cyber threats have grown significantly, making it increasingly challenging for traditional threat detection and incident response methods to operate effectively in complex network environments. While Large Language Models (LLMs) have shown great potential in early threat detection, their capabilities remain limited when it comes to automated incident response after an intrusion. To address this gap, we construct an incremental benchmark based on real-world incident response tasks to thoroughly evaluate the performance of LLMs in this domain. Our analysis reveals several key challenges that hinder the practical application of contemporary LLMs, including context loss, hallucinations, privacy protection concerns, and their limited ability to provide accurate, context-specific recommendations. In response to these challenges, we propose IRCopilot, a novel framework for automated incident response powered by LLMs. IRCopilot mimics the three dynamic phases of a real-world incident response team using four collaborative LLM-based session components. These components are designed with clear divisions of responsibility, reducing issues such as hallucinations and context loss. Our method leverages diverse prompt designs and strategic responsibility segmentation, significantly improving the system's practicality and efficiency. Experimental results demonstrate that IRCopilot outperforms baseline LLMs across key benchmarks, achieving sub-task completion rates of 150%, 138%, 136%, 119%, and 114% for various response tasks. Moreover, IRCopilot exhibits robust performance on public incident response platforms and in real-world attack scenarios, showcasing its strong applicability.
Chinese: 事件响应在减轻网络攻击影响中至关重要,然而传统方法难以应对现代威胁,为此我们提出了IRCopilot,一种基于大语言模型的新型框架,通过解决幻觉和上下文丢失等挑战,显著提升了自动化响应能力,在基准测试和实际场景中均表现出卓越性能。
English: Incident response is crucial for mitigating cyber attacks, but traditional methods struggle with modern threats, leading to the development of IRCopilot, a novel LLM-based framework that enhances automated response by addressing challenges like hallucinations and context loss, achieving superior performance in benchmarks and real-world scenarios.

Authors:Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding
Title: AdaTP: Attention-Debiased Token Pruning for Video Large Language Models
Abstract:
Video Large Language Models (Video LLMs) have achieved remarkable results in video understanding tasks. However, they often suffer from heavy computational overhead due to the large number of visual tokens generated from multiple video frames. Existing visual token compression methods often rely on attention scores from language models as guidance. However, these scores exhibit inherent biases: global bias reflects a tendency to focus on the two ends of the visual token sequence, while local bias leads to an over-concentration on the same spatial positions across different frames. To address the issue of attention bias, we propose $\textbf{A}$ttention-$\textbf{D}$ebi$\textbf{a}$sed $\textbf{T}$oken $\textbf{P}$runing for Video Large Language Models ($\textbf{AdaTP}$), a novel token pruning pipeline for Video LLMs. AdaTP integrates two dedicated debiasing modules into the pipeline, targeting global attention bias and local attention bias, respectively. Without the need for additional training, our method significantly reduces the computational overhead of Video LLMs while retaining the performance of vanilla models. Extensive evaluation shows that AdaTP achieves state-of-the-art performance in various commonly used video understanding benchmarks. In particular, on LLaVA-OneVision-7B, AdaTP maintains performance without degradation while using only up to $27.3\%$ FLOPs compared to the vanilla model. Our code will be released soon.
中文: 视频大语言模型存在注意力偏差导致计算效率低下,而AdaTP方法通过消除全局和局部注意力偏差,无需额外训练即可显著降低计算成本,在仅使用27.3%计算量的情况下保持模型性能无损。
English: Video LLMs face computational inefficiency due to biased attention mechanisms, but the proposed AdaTP method effectively reduces computational overhead by debiasing global and local attention without additional training, maintaining performance while using only 27.3% of the original FLOPs.

Authors:Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li
Title: AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
Abstract:
The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]{https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html}, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlink{https://huggingface.co/datasets/SGJQovo/AgentRecBench}{https://huggingface.co/datasets/SGJQovo/AgentRecBench}.
中文: 基于大语言模型的智能推荐系统通过自主决策能力超越传统方法,但缺乏标准化评估;本研究提出模拟器、模块化框架和首个综合基准,并公开资源以推动可复现研究。
English: Agentic recommender systems using LLMs enable autonomous decision-making and outperform traditional methods, but require standardized evaluation, which this study addresses by proposing a simulator, a modular framework, and a comprehensive benchmark with publicly available resources.

Authors:Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, Miki Haseyama
Title: Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory
Abstract:
Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.
中文摘要:本文提出了一种基于扩散模型的多样性驱动生成式数据集蒸馏方法,通过自适应记忆机制增强分布对齐和多样性,在多数情况下优于现有最优方法。
English Summary: This paper introduces a diversity-driven generative dataset distillation method using a diffusion model with self-adaptive memory to enhance distribution alignment and diversity, achieving superior performance over existing methods in most scenarios.

Authors:Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger
Title: BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change
Abstract:
Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.
中文摘要:本文介绍了首个用于视频中行为矛盾/犹豫识别的多模态数据集,该数据集采集自加拿大各地的多样化参与者,旨在解决当前该领域机器学习模型开发中数据匮乏的问题。
English Summary: This paper introduces the first multimodal dataset for recognizing behavioral ambivalence and hesitancy in videos, collected from diverse participants across Canada, to address the current lack of data for developing machine learning models in this area.

Authors:Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang
Title: OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model
Abstract:
Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}
中文: OpenHOI是首个开放世界手物交互合成框架,通过结合多模态语言模型进行功能感知定位与任务分解,并采用物理优化的扩散模型,能够根据自由形式语言指令为未见物体生成逼真的长序列操作动作,实现卓越的泛化能力。
English: OpenHOI is a pioneering framework that synthesizes realistic 3D hand-object interactions for unseen objects using open-vocabulary language commands, integrating a multimodal language model for affordance grounding and task decomposition with a physics-refined diffusion model to ensure generalization and physical plausibility.

Authors:Yahao Fan, Tianxiang Gui, Kaiyang Ji, Shutong Ding, Chixuan Zhang, Jiayuan Gu, Jingyi Yu, Jingya Wang, Ye Shi
Title: One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion
Abstract:
Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid Motion Imagery (HMI) - future state predictions synthesized through an autoregressive terrain-aware diffusion planner curated by aggregating rollouts from specialized policies across various distinct terrains. Unlike human motion datasets requiring laborious retargeting, our data directly captures humanoid kinematics, enabling the diffusion planner to synthesize "dreamed" trajectories that encode terrain-specific physical constraints. These trajectories act as dynamic objectives for our HMI-conditioned policy, bypassing manual reward engineering and enabling cross-terrain generalization. DreamPolicy addresses the scalability limitations of prior methods: while traditional RL fails to exploit growing datasets, our framework scales seamlessly with more offline data. As the dataset expands, the diffusion prior learns richer locomotion skills, which the policy leverages to master new terrains without retraining. Experiments demonstrate that DreamPolicy achieves average 90% success rates in training environments and an average of 20% higher success on unseen terrains than the prevalent method. It also generalizes to perturbed and composite scenarios where prior approaches collapse. By unifying offline data, diffusion-based trajectory synthesis, and policy optimization, DreamPolicy overcomes the "one task, one policy" bottleneck, establishing a paradigm for scalable, data-driven humanoid control.
Chinese Summary: DreamPolicy通过整合离线数据与扩散驱动运动合成,提出统一框架解决人形机器人运动扩展性难题,无需任务特定奖励即可实现未知地形的零样本泛化。
English Summary: DreamPolicy is a unified framework that overcomes scalability challenges in humanoid locomotion by integrating offline data with diffusion-driven motion synthesis, enabling zero-shot generalization to unseen terrains without task-specific rewards.

Authors:Zhaoyang Wang, Jinqi Jiang, Tian Qiu, Hui Liu, Xianfeng Tang, Huaxiu Yao
Title: Efficient Long CoT Reasoning in Small Language Models
Abstract:
Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.
Chinese: 本研究提出一种剪裁大型模型中长思维链推理冗余步骤的方法,并通过策略内数据筛选将高效推理能力有效蒸馏到小型语言模型中,在保持竞争力的同时显著减少了冗余推理步骤的生成。
English: This study introduces a method to prune redundant steps in long chain-of-thought reasoning from large models and uses on-policy data curation to effectively distill efficient reasoning abilities into small language models, maintaining competitive performance while reducing redundancy.

Authors:Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
Title: Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning
Abstract:
Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.
中文: 提出的逐步推理检查点分析(SRCA)框架通过引入检查点,结合答案聚类搜索和检查点候选增强策略,有效解决了思维链推理中的路径同质化和效率低下问题,显著提升了数学推理的准确性。
English: The proposed Stepwise Reasoning Checkpoint Analysis (SRCA) framework addresses path homogenization and inefficiency in Chain-of-Thought reasoning by introducing checkpoints with Answer-Clustered Search and Checkpoint Candidate Augmentation, significantly improving mathematical reasoning accuracy over existing methods.

Authors:Jingwen Cheng, Ruikun Li, Huandong Wang, Yong Li
Title: Sparse Diffusion Autoencoder for Test-time Adapting Prediction of Complex Systems
Abstract:
Predicting the behavior of complex systems is critical in many scientific and engineering domains, and hinges on the model's ability to capture their underlying dynamics. Existing methods encode the intrinsic dynamics of high-dimensional observations through latent representations and predict autoregressively. However, these latent representations lose the inherent spatial structure of spatiotemporal dynamics, leading to the predictor's inability to effectively model spatial interactions and neglect emerging dynamics during long-term prediction. In this work, we propose SparseDiff, introducing a test-time adaptation strategy to dynamically update the encoding scheme to accommodate emergent spatiotemporal structures during the long-term evolution of the system. Specifically, we first design a codebook-based sparse encoder, which coarsens the continuous spatial domain into a sparse graph topology. Then, we employ a graph neural ordinary differential equation to model the dynamics and guide a diffusion decoder for reconstruction. SparseDiff autoregressively predicts the spatiotemporal evolution and adjust the sparse topological structure to adapt to emergent spatiotemporal patterns by adaptive re-encoding. Extensive evaluations on representative systems demonstrate that SparseDiff achieves an average prediction error reduction of 49.99\% compared to baselines, requiring only 1\% of the spatial resolution.
中文摘要:SparseDiff通过测试时自适应策略动态更新稀疏图表示来捕捉长期预测中涌现的时空模式,相比现有方法仅需1%空间分辨率即可实现49.99%的平均预测误差降低。
English Summary: SparseDiff introduces a test-time adaptation strategy that dynamically updates sparse graph representations to capture emergent spatiotemporal patterns during long-term prediction, achieving a 49.99% error reduction using only 1% spatial resolution compared to existing methods.

Authors:Youliang Yuan, Wenxiang Jiao, Yuejin Xie, Chihao Shen, Menghan Tian, Wenxuan Wang, Jen-tse Huang, Pinjia He
Title: Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
Abstract:
Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.
中文摘要:该研究提出PaSBench基准,用于评估主动安全人工智能系统,发现当前先进模型虽具中等准确率,但因主动推理不稳定而漏判近半数潜在风险,为开发可靠防护型AI指明关键方向。
English Summary: The study introduces PaSBench, a benchmark for evaluating proactive safety AI systems, revealing that current models like Gemini-2.5-pro struggle with unstable reasoning and miss nearly half of potential risks despite moderate accuracy.

Authors:Omar Moured, Yufan Chen, Ruiping Liu, Simon Reiß, Philip Torr, Jiaming Zhang, Rainer Stiefelhagen
Title: CHAOS: Chart Analysis with Outlier Samples
Abstract:
Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: http://huggingface.co/datasets/omoured/CHAOS.
中文: CHAOS基准通过系统评估多模态大语言模型对多种文本和视觉图表扰动的鲁棒性,为推进图表理解研究提供了关键见解。
English: The CHAOS benchmark evaluates the robustness of Multimodal Large Language Models against various textual and visual chart perturbations across multiple severity levels, providing critical insights to advance chart understanding research.

Authors:Igor Udovichenko, Olivier Croissant, Anita Toleutaeva, Evgeny Burnaev, Alexander Korotin
Title: Risk-Averse Reinforcement Learning with Itakura-Saito Loss
Abstract:
Risk-averse reinforcement learning finds application in various high-stakes fields. Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value. These preferences can be framed through utility theory. We focus on the specific case of the exponential utility function, where one can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications. To address this, we introduce to the broad machine learning community a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions. We evaluate the Itakura-Saito loss function against established alternatives, both theoretically and empirically. In the experimental section, we explore multiple scenarios, some with known analytical solutions, and show that the considered loss function outperforms the alternatives.
Chinese Summary: 风险规避强化学习采用最小化风险的策略而非最大化期望回报,本研究提出了一种基于Itakura-Saito散度的稳定损失函数,在价值函数学习中优于现有方法。
English Summary: Risk-averse reinforcement learning uses policies that minimize risk rather than maximize expected returns, and this study introduces a stable Itakura-Saito loss function that outperforms existing methods in learning value functions.

Authors:Song Dai, Yibo Yan, Jiamin Su, Dongfang Zihao, Yubo Gao, Yonghua Hei, Jungang Li, Junyan Zhang, Sicheng Tao, Zhuoran Gao, Xuming Hu
Title: PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
中文摘要:PhysicsArena是首个多模态物理推理基准,旨在全面评估多模态大语言模型在变量识别、过程构建和求解推导三个关键维度的能力。
English Summary: PhysicsArena is introduced as the first multimodal benchmark to comprehensively evaluate Multimodal Large Language Models' physics reasoning abilities across variable identification, process formulation, and solution derivation.

Authors:Haoming Huang, Yibo Yan, Jiahao Huo, Xin Zou, Xinfeng Li, Kun Wang, Xuming Hu
Title: Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
Abstract:
Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit's effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.
Chinese: 本文提出了PhantomCircuit框架,通过分析训练过程中的注意力模式和知识回路动态,来全面检测和解析大语言模型中知识遮蔽这一难以捉摸的幻觉现象。
English: The PhantomCircuit framework is introduced to analyze and detect knowledge overshadowing in LLMs, a challenging hallucination where one piece of knowledge masks another, by examining attention patterns and circuit dynamics during training.

Authors:Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang, Xiang Liu, Xuming Hu
Title: CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring
Abstract:
Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence-grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state-of-the-art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.
Chinese: CAFES框架通过协作多智能体系统提升自动作文评分的评估准确性和人类一致性,实现了21%的平均QWK提升。
English: The CAFES framework introduces a collaborative multi-agent system that enhances automated essay scoring by improving evaluation accuracy and human alignment, achieving a 21% average QWK improvement.

Authors:Jiankun Zhang, Shenglai Zeng, Jie Ren, Tianqi Zheng, Hui Liu, Xianfeng Tang, Hui Liu, Yi Chang
Title: Beyond Text: Unveiling Privacy Vulnerabilities in Multi-modal Retrieval-Augmented Generation
Abstract:
Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.
中文摘要:本研究首次系统揭示了多模态检索增强生成系统的隐私风险,通过组合提示攻击证明攻击者能够提取私有信息,凸显了开发隐私保护技术的迫切需求。
English Summary: This study exposes novel privacy risks in Multimodal Retrieval-Augmented Generation systems where attackers can extract sensitive information through crafted prompts, demonstrating how language models directly replicate or indirectly reveal private retrieved content.

Authors:Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
Title: Search and Refine During Think: Facilitating Knowledge Refinement for Improved Retrieval-Augmented Reasoning
Abstract:
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new "search-and-refine-during-think" paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
Chinese: AutoRefine是一种强化学习框架,通过在搜索之间加入知识提炼步骤来增强大语言模型,显著提升了复杂场景下的推理准确性。
English: AutoRefine is a reinforcement learning framework that enhances large language models by integrating knowledge refinement steps between searches, significantly improving reasoning accuracy in complex scenarios.

Authors:Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
Title: A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Abstract:
Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
中文: 人工智能驱动的计算机使用代理已能自主操作界面,但带来了新的安全风险;本文通过系统化威胁分类、防御策略和评估指标,为未来安全设计提供了结构化指导。
English: AI-driven Computer-Using Agents (CUAs) have evolved to autonomously operate interfaces but introduce new safety and security risks, which this paper systematically analyzes by categorizing threats, defensive strategies, and evaluation metrics to guide future secure development.

Authors:Jiafu Hao, Chentao Yue, Hao Chang, Branka Vucetic, Yonghui Li
Title: Short Wins Long: Short Codes with Language Model Semantic Correction Outperform Long Codes
Abstract:
This paper presents a novel semantic-enhanced decoding scheme for transmitting natural language sentences with multiple short block codes over noisy wireless channels. After ASCII source coding, the natural language sentence message is divided into segments, where each is encoded with short block channel codes independently before transmission. At the receiver, each short block of codewords is decoded in parallel, followed by a semantic error correction (SEC) model to reconstruct corrupted segments semantically. We design and train the SEC model based on Bidirectional and Auto-Regressive Transformers (BART). Simulations demonstrate that the proposed scheme can significantly outperform encoding the sentence with one conventional long LDPC code, in terms of block error rate (BLER), semantic metrics, and decoding latency. Finally, we proposed a semantic hybrid automatic repeat request (HARQ) scheme to further enhance the error performance, which selectively requests retransmission depends on semantic uncertainty.
中文摘要: 本文提出一种新颖的语义增强解码方案,通过短分组码和基于BART的语义纠错模型,在噪声信道中传输自然语言句子,在误码率、语义指标和解码延迟方面均优于传统长LDPC编码方案。
English Summary: This paper introduces a semantic-enhanced decoding method using short block codes and a BART-based semantic error correction model to improve transmission of natural language over noisy channels, outperforming traditional long LDPC codes in error rate, semantic accuracy, and latency.

Authors:Xinyi Mou, Chen Qian, Wei Liu, Xuanjing Huang, Zhongyu Wei
Title: EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation
Abstract:
Large language models (LLMs) have demonstrated an impressive ability to role-play humans and replicate complex social dynamics. While large-scale social simulations are gaining increasing attention, they still face significant challenges, particularly regarding high time and computation costs. Existing solutions, such as distributed mechanisms or hybrid agent-based model (ABM) integrations, either fail to address inference costs or compromise accuracy and generalizability. To this end, we propose EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation. EcoLANG operates in two stages: (1) language evolution, where we filter synonymous words and optimize sentence-level rules through natural selection, and (2) language utilization, where agents in social simulations communicate using the evolved language. Experimental results demonstrate that EcoLANG reduces token consumption by over 20%, enhancing efficiency without sacrificing simulation accuracy.
中文摘要:EcoLANG通过语言演化和应用两阶段框架,在大规模社会模拟中降低20%以上的令牌消耗同时保持准确性,有效解决了基于大语言模型的模拟效率难题。
English Summary: EcoLANG introduces a two-stage language evolution and utilization framework that reduces token consumption by over 20% in social simulations while maintaining accuracy, addressing efficiency challenges in large-scale LLM-based simulations.

Authors:Chengfeng Wang, Wei Zhai, Yuhang Yang, Yang Cao, Zhengjun Zha
Title: GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images
Abstract:
Estimating the geometry level of human-scene contact aims to ground specific contact surface points at 3D human geometries, which provides a spatial prior and bridges the interaction between human and scene, supporting applications such as human behavior analysis, embodied AI, and AR/VR. To complete the task, existing approaches predominantly rely on parametric human models (e.g., SMPL), which establish correspondences between images and contact regions through fixed SMPL vertex sequences. This actually completes the mapping from image features to an ordered sequence. However, this approach lacks consideration of geometry, limiting its generalizability in distinct human geometries. In this paper, we introduce GRACE (Geometry-level Reasoning for 3D Human-scene Contact Estimation), a new paradigm for 3D human contact estimation. GRACE incorporates a point cloud encoder-decoder architecture along with a hierarchical feature extraction and fusion module, enabling the effective integration of 3D human geometric structures with 2D interaction semantics derived from images. Guided by visual cues, GRACE establishes an implicit mapping from geometric features to the vertex space of the 3D human mesh, thereby achieving accurate modeling of contact regions. This design ensures high prediction accuracy and endows the framework with strong generalization capability across diverse human geometries. Extensive experiments on multiple benchmark datasets demonstrate that GRACE achieves state-of-the-art performance in contact estimation, with additional results further validating its robust generalization to unstructured human point clouds.
Chinese: GRACE提出了一种新的三维人-场景接触估计方法,通过点云编码-解码架构和分层特征融合模块,将三维人体几何结构与二维图像交互语义有效结合,实现了高精度的接触区域建模和强大的跨人体几何泛化能力。
English: GRACE introduces a novel approach to 3D human-scene contact estimation by integrating 3D geometric structures with 2D image semantics through a point cloud encoder-decoder and hierarchical feature fusion, achieving superior accuracy and generalization across diverse human geometries.

Authors:Lipeng Zhu, He Sun, Wenyan Ma, Zhenyu Xiao, Rui Zhang
Title: Multiuser Communications Aided by Cross-Linked Movable Antenna Array: Architecture and Optimization
Abstract:
Movable antenna (MA) has been regarded as a promising technology to enhance wireless communication performance by enabling flexible antenna movement. However, the hardware cost of conventional MA systems scales with the number of movable elements due to the need for independently controllable driving components. To reduce hardware cost, we propose in this paper a novel architecture named cross-linked MA (CL-MA) array, which enables the collective movement of multiple antennas in both horizontal and vertical directions. To evaluate the performance benefits of the CL-MA array, we consider an uplink multiuser communication scenario. Specifically, we aim to minimize the total transmit power while satisfying a given minimum rate requirement for each user by jointly optimizing the horizontal and vertical antenna position vectors (APVs), the receive combining at the base station (BS), and the transmit power of users. A globally lower bound on the total transmit power is derived, with closed-form solutions for the APVs obtained under the condition of a single channel path for each user. For the more general case of multiple channel paths, we develop a low-complexity algorithm based on discrete antenna position optimization. Additionally, to further reduce antenna movement overhead, a statistical channel-based antenna position optimization approach is proposed, allowing for unchanged APVs over a long time period. Simulation results demonstrate that the proposed CL-MA schemes significantly outperform conventional fixed-position antenna (FPA) systems and closely approach the theoretical lower bound on the total transmit power. Compared to the instantaneous channel-based CL-MA optimization, the statistical channel-based approach incurs a slight performance loss but achieves significantly lower movement overhead, making it an appealing solution for practical wireless systems.
中文摘要:本文提出了一种新型交叉链接可移动天线阵列,通过多天线协同移动降低硬件成本,在上行多用户通信中实现了接近理论极限的发射功率效率,并基于瞬时与统计信道优化在性能与移动开销间取得平衡。
English Summary: The paper introduces a cross-linked movable antenna (CL-MA) array that reduces hardware costs by enabling collective antenna movement, achieving near-optimal transmit power efficiency in multiuser uplink communications while balancing performance and movement overhead through both instantaneous and statistical channel optimization approaches.

Authors:Sparsh Bansal, Mingyang Wu, Xin Wang, Shu Hu
Title: Robust Fairness Vision-Language Learning for Medical Image Analysis
Abstract:
The advent of Vision-Language Models (VLMs) in medical image analysis has the potential to help process multimodal inputs and increase performance over traditional inference methods. However, when considering the domain in which these models will be implemented, fairness and robustness are important to ensure the model stays true for any patient. In this paper, we introduce a framework for ensuring robustness and fairness of VLM models. This framework modifies the loss function at training by identifying and adjusting faulty image-text pairs through a Dynamic Bad Pair Mining algorithm and also utilizing Sinkhorn distance to ensure the loss distributions of protected groups do not deviate from the total loss. Experimental testing of our framework shows up to a 8.6\% improvement when looking at equity-scaled AUC.
中文: 本文提出了一种针对医学影像视觉语言模型的公平性与鲁棒性框架,通过动态不良配对挖掘和Sinkhorn距离优化,将公平性标度AUC指标最高提升了8.6%。
English: This paper introduces a fairness and robustness framework for Vision-Language Models in medical imaging that improves equity-scaled AUC by up to 8.6% through dynamic bad pair mining and Sinkhorn distance optimization.

Authors:Haoran Ou, Gelei Deng, Xingshuo Han, Jie Zhang, Xinlei He, Han Qiu, Shangwei Guo, Tianwei Zhang
Title: Holmes: Automated Fact Check with Large Language Models
Abstract:
The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by advances in AI, this study explores using Large Language Models (LLMs) for automated disinformation detection. The empirical study shows that (1) LLMs alone cannot reliably assess the truthfulness of claims; (2) providing relevant evidence significantly improves their performance; (3) however, LLMs cannot autonomously search for accurate evidence. To address this, we propose Holmes, an end-to-end framework featuring a novel evidence retrieval method that assists LLMs in collecting high-quality evidence. Our approach uses (1) LLM-powered summarization to extract key information from open sources and (2) a new algorithm and metrics to evaluate evidence quality. Holmes enables LLMs to verify claims and generate justifications effectively. Experiments show Holmes achieves 88.3% accuracy on two open-source datasets and 90.2% in real-time verification tasks. Notably, our improved evidence retrieval boosts fact-checking accuracy by 30.8% over existing methods
中文摘要:本研究提出Holmes框架,通过改进证据检索方法增强大语言模型检测多模态虚假信息的能力,在验证任务中准确率超过88%。
English Summary: The study introduces Holmes, an end-to-end framework that enhances large language models' ability to detect multimodal disinformation by improving evidence retrieval, achieving over 88% accuracy in verification tasks.

Authors:Ruikun Li, Jingwen Cheng, Huandong Wang, Qingmin Liao, Yong Li
Title: Predicting the Dynamics of Complex System via Multiscale Diffusion Autoencoder
Abstract:
Predicting the dynamics of complex systems is crucial for various scientific and engineering applications. The accuracy of predictions depends on the model's ability to capture the intrinsic dynamics. While existing methods capture key dynamics by encoding a low-dimensional latent space, they overlook the inherent multiscale structure of complex systems, making it difficult to accurately predict complex spatiotemporal evolution. Therefore, we propose a Multiscale Diffusion Prediction Network (MDPNet) that leverages the multiscale structure of complex systems to discover the latent space of intrinsic dynamics. First, we encode multiscale features through a multiscale diffusion autoencoder to guide the diffusion model for reliable reconstruction. Then, we introduce an attention-based graph neural ordinary differential equation to model the co-evolution across different scales. Extensive evaluations on representative systems demonstrate that the proposed method achieves an average prediction error reduction of 53.23% compared to baselines, while also exhibiting superior robustness and generalization.
中文: 提出的多尺度扩散预测网络通过利用多尺度结构和基于注意力的图神经微分方程,显著提升了复杂系统预测的精度与鲁棒性,相比现有方法平均误差降低了53.23%。
English: The proposed Multiscale Diffusion Prediction Network (MDPNet) utilizes multiscale structures and an attention-based graph neural ODE to significantly enhance prediction accuracy and robustness in complex systems, achieving a 53.23% average error reduction over existing methods.

Authors:Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song
Title: Quantizing Diffusion Models from a Sampling-Aware Perspective
Abstract:
Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.
中文: 本研究提出了一种采样感知量化策略,通过混合阶轨迹对齐技术解决扩散模型中量化噪声对方向估计的干扰,在资源受限环境下实现了双重加速,同时保持了高保真度和优异的生成质量。
English: This study introduces a sampling-aware quantization strategy with a Mixed-Order Trajectory Alignment technique to address the disruption of directional estimation caused by quantization noise in diffusion models, achieving dual acceleration while maintaining high fidelity and superior generation quality in resource-limited settings.

Authors:Jiahao Wang, Mingyue Cheng, Qi Liu
Title: Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting
Abstract:
Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow-thinking LLMs (e.g., ChatGPT-o1, DeepSeek-R1) have demonstrated impressive multi-step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow-thinking LLMs exhibit non-trivial zero-shot forecasting capabilities, especially in capturing high-level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning-based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.
中文摘要:本研究探讨了慢思考大语言模型能否通过将时间序列预测重构为结构化推理任务来实现零样本预测,发现它们在捕捉趋势和上下文变化方面展现出潜力,同时揭示了其优势与局限。
English Summary: This study explores whether slow-thinking large language models can perform zero-shot time series forecasting by reframing it as a structured reasoning task, finding they show promising capabilities in capturing trends and contextual shifts while revealing both potential and limitations.

Authors:Naibin Gu, Yilong Chen, Zhenyu Zhang, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Title: Advantageous Parameter Expansion Training Makes Better Large Language Models
Abstract:
Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
中文: APEX方法通过扩展优势参数提升模型性能,在指令微调中仅用52%参数即超越全参数训练,在预训练中仅需33%数据即可达到同等效果并显著提升下游任务表现。
English: APEX is a training method that expands advantageous parameters to improve model performance, achieving better results with fewer parameters and less data in both instruction tuning and pre-training.

Authors:Farshad Rostami Ghadi, Kai-Kit Wong, F. Javier Lopez-Martinez, George C. Alexandropoulos, Chan-Byoung Chae
Title: Performance Analysis of Wireless Communication Systems Assisted by Fluid Reconfigurable Intelligent Surfaces
Abstract:
This letter investigates the performance of emerging wireless communication systems assisted by a fluid reconfigurable intelligent surface (FRIS). Unlike conventional reconfigurable intelligent surfaces (RISs), an FRIS consists of fluid-inspired metamaterials arranged in a densely packed matrix of sub-elements over a surface. It dynamically activates specific elements for signal reflection and modulation based on real-time channel conditions. Considering a downlink scenario where a base station communicates with a user terminal via a FRIS, we first characterize the statistical behavior of the equivalent end-to-end channel by deriving closed-form approximations for its cumulative distribution and probability density functions. Using these expressions, an analytical approximation for the outage probability and a tight upper bound on the ergodic capacity, including their asymptotic behaviors for high signal-to-noise ratio values, are derived. Our findings reveal key performance trends demonstrating that FRIS can substantially improve link reliability and spectral efficiency compared to conventional RISs, owing to its capability to dynamically select optimal elements from a dense preconfigured grid.
Chinese: 研究表明,流体可重构智能表面(FRIS)通过从密集网格中动态选择最优单元,显著提高了无线通信的可靠性和频谱效率,其推导的中断概率和遍历容量分析模型证实了其性能优于传统可重构智能表面。
English: This study demonstrates that fluid reconfigurable intelligent surfaces (FRIS) significantly enhance wireless communication reliability and spectral efficiency by dynamically selecting optimal elements from a dense grid, outperforming conventional RISs through derived analytical models for outage probability and ergodic capacity.

Authors:Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, Qianqian Xie
Title: FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
Abstract:
We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.
中文: FinTagging是首个全面、表格感知的XBRL基准测试,通过分解为金融实体提取和概念对齐两个子任务,揭示了大型语言模型在信息提取方面的优势,但在细粒度语义对齐方面仍存在不足。
English: FinTagging is a comprehensive XBRL benchmark that evaluates LLMs' ability to extract financial data and align it with taxonomy concepts, revealing their strengths in information extraction but limitations in fine-grained semantic alignment.

Authors:Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
Title: TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
Abstract:
The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.
Chinese: TailorKV是一种混合压缩方法,通过结合量化和卸载技术解决大语言模型中KV缓存的内存开销问题,在激进压缩设置下实现近乎无损的性能和高效的GPU利用率。
English: TailorKV is a hybrid compression method that combines quantization and offloading to address the memory overhead of KV cache in LLMs, achieving nearly lossless performance and efficient GPU utilization under aggressive compression settings.

Authors:Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, Milad Nasr, Sahra Ghalebikesabi, Niloofar Mireshghallah, Meenatchi Sundaram Mutu Selva Annamalai, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper
Title: Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
Abstract:
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
中文摘要:将强大的LiRA成员推理攻击扩展到GPT-2等大型语言模型表明,此类攻击虽能在预训练模型上实现有限成功(如AUC<0.7),但其实际效果与隐私指标的关系比既往研究更为复杂。
English Summary: Scaling the strong LiRA membership inference attack to large language models like GPT-2 demonstrates that such attacks can succeed on pre-trained LLMs, though their effectiveness remains limited in practical scenarios with complex relationships to privacy metrics.

Authors:Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu
Title: Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens
Abstract:
The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.
中文摘要:本文提出一种基于信息熵的自适应思考策略,能在置信度足够高时动态终止推理过程,相比标准方法在提升准确率的同时大幅降低了计算开销。
English Summary: This paper introduces an entropy-based Adaptive Think strategy that dynamically stops reasoning when confidence is high, achieving improved accuracy with significantly reduced computational costs compared to standard methods.

Authors:Shihao Li, Chenglong Li, Aihua Zheng, Jin Tang, Bin Luo
Title: ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification
Abstract:
Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.
中文摘要:提出的身份条件提示学习框架利用CLIP的跨模态对齐能力,通过文本语义统一多光谱视觉特征,有效克服光谱差异,在多个基准测试中超越现有最优方法。
English Summary: The proposed Identity-Conditional Prompt Learning (ICPL) framework leverages CLIP's cross-modal alignment to unify multi-spectral visual features through text semantics, overcoming spectral differences and outperforming state-of-the-art methods on multiple benchmarks.

Authors:Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Yanxi Zhao, Yifan Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu
Title: MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models
Abstract:
Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.
中文: MemeReaCon是一个新颖的基准测试,旨在评估大型视觉语言模型在原始对话语境中理解表情包的能力,揭示了它们在把握语境相关意图方面的不足,并作为推动开发更先进模型的工具。
English: MemeReaCon is a new benchmark designed to assess how well Large Vision Language Models understand memes in their original conversational context, revealing their limitations in grasping context-dependent intent and serving as a tool to advance more sophisticated model development.

Authors:Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu
Title: T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models' inherent reasoning capabilities. To address these limitations, we present T$^2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T$^2$ leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T$^2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T$^2$ not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2\%.
Chinese: T$^2$框架根据问题复杂度动态调整推理深度,在多种基准测试中不仅实现了更高准确率,还将计算开销降低了最高25.2%。
English: The T$^2$ framework dynamically adjusts reasoning depth based on question complexity, achieving higher accuracy while reducing computational overhead by up to 25.2% across diverse benchmarks.

Authors:Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
Title: SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Abstract:
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
中文摘要:本研究提出了一种新型双工语音对话架构,无需语音预训练即可实现用户与智能体的实时同步交互,在推理能力和适应性方面优于先前模型,并成为首个公开可用的可复现系统。
English Summary: This study introduces a novel duplex speech-to-speech architecture that enables real-time, simultaneous user-agent interactions without requiring speech pretraining, outperforming previous models in reasoning and adaptability while being the first openly available model for reproducibility.

Authors:Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg
Title: Word Level Timestamp Generation for Automatic Speech Recognition and Translation
Abstract:
We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.
中文: 本文提出一种数据驱动方法,通过教师模型和特殊标记使Canary模型能直接预测词级时间戳,在四种语言中实现了80%-90%的准确率与召回率,且词错误率影响极小,并成功应用于语音翻译任务。
English: This paper presents a data-driven method that enables the Canary model to directly predict word-level timestamps using a teacher model and a special token, achieving high precision and recall with minimal impact on word error rate across multiple languages and extending to speech translation tasks.

Authors:Minh Ngoc Ta, Dong Cao Van, Duc-Anh Hoang, Minh Le-Anh, Truong Nguyen, My Anh Tran Nguyen, Yuxia Wang, Preslav Nakov, Sang Dinh
Title: FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning
Abstract:
The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between human-written, AI-generated, and human-AI collaborative texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset FAIDSet. We further introduce a fine-grained detection framework FAID to classify text into these three categories, meanwhile identifying the underlying AI model family. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling AI families as distinct stylistic entities, FAID offers improved interpretability. We incorporate an adaptation to address distributional shifts without retraining for unseen data. Experimental results demonstrate that FAID outperforms several baseline approaches, particularly enhancing the generalization accuracy on unseen domains and new AI models. It provide a potential solution for improving transparency and accountability in AI-assisted writing.
中文: 本研究提出FAID框架,通过多层级对比学习和多任务分类捕捉文本风格特征,能精细区分人类撰写、大模型生成及人机协作文本,并在跨领域和未知模型的泛化检测中优于现有方法。
English: This study introduces FAID, a fine-grained detection framework that distinguishes between human-written, LLM-generated, and collaborative texts by capturing stylistic cues through multi-level contrastive learning and multi-task classification, outperforming existing methods in generalization across domains and unseen LLMs.

Authors:Minh Ngoc Ta, Dong Cao Van, Duc-Anh Hoang, Minh Le-Anh, Truong Nguyen, My Anh Tran Nguyen, Yuxia Wang, Preslav Nakov, Sang Dinh
Title: FAID: Fine-Grained AI-Generated Text Detection Using Multi-Task Auxiliary and Multi-Level Contrastive Learning
Abstract:
The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between human-written, LLM-generated, and human--LLM collaborative texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset FAIDSet. We further introduce a fine-grained detection framework FAID to classify text into these three categories, and also to identify the underlying LLM family of the generator. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling LLM families as distinct stylistic entities, we incorporate an adaptation to address distributional shifts without retraining for unseen data. Our experimental results demonstrate that FAID outperforms several baselines, particularly enhancing the generalization accuracy on unseen domains and new LLMs, thus offering a potential solution for improving transparency and accountability in AI-assisted writing.
中文: 本研究提出FAID框架,通过多层级对比学习和多任务分类捕捉文本风格特征,能精细区分人类撰写、大模型生成及人机协作文本,并在跨领域和未知模型的泛化检测中优于现有方法。
English: This study introduces FAID, a fine-grained detection framework that distinguishes between human-written, LLM-generated, and collaborative texts by capturing stylistic cues through multi-level contrastive learning and multi-task classification, outperforming existing methods in generalization across domains and unseen LLMs.

Authors:Fu Luo, Xi Lin, Mengyuan Zhong, Fei Liu, Zhenkun Wang, Jianyong Sun, Qingfu Zhang
Title: Learning to Insert for Constructive Neural Vehicle Routing Solver
Abstract:
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
中文: L2C-Insert提出了一种基于插入的神经组合优化方法,通过在部分路径中策略性地插入未访问节点来提升解决方案的灵活性和质量,在各类路径优化问题上均优于传统方法。
English: L2C-Insert introduces an insertion-based neural combinatorial optimization method that enhances solution flexibility and quality by strategically placing unvisited nodes in partial routes, outperforming traditional approaches across various routing problems.

Authors:Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Title: Structured Agent Distillation for Large Language Model
Abstract:
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.
中文总结:结构化智能体蒸馏通过分别监督推理和行动片段,将大型语言模型智能体高效压缩为小模型,在保持高性能的同时显著降低了部署成本。
English Summary: Structured Agent Distillation effectively compresses large LLM agents into smaller models by employing segment-specific supervision of reasoning and action spans, maintaining high performance while significantly reducing deployment costs.

Authors:Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang
Title: Structured Agent Distillation for Large Language Model
Abstract:
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.
中文总结:结构化智能体蒸馏通过分别监督推理和行动片段,将大型语言模型智能体高效压缩为小模型,在保持高性能的同时显著降低了部署成本。
English Summary: Structured Agent Distillation effectively compresses large LLM agents into smaller models by employing segment-specific supervision of reasoning and action spans, maintaining high performance while significantly reducing deployment costs.

Authors:Farshad Rostami Ghadi, Kai-Kit Wong, Masoud Kaveh, F. Javier Lopez-Martinez, Chan-Byoung Chae, George C. Alexandropoulos
Title: FIRES: Fluid Integrated Reflecting and Emitting Surfaces
Abstract:
This letter introduces the concept of fluid integrated reflecting and emitting surface (FIRES), which constitutes a new paradigm seamlessly integrating the flexibility of fluid-antenna systems (FASs) with the dual functionality of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs). The potential of the proposed metasurface structure is studied though an FIRES-enabled multicast system based on the energy splitting protocol. In this model, the FIRES is divided into non-overlapping subareas, each functioning as a 'fluid' element capable of concurrent reflection and transmission and changing its position of radiation within the subarea. In particular, we formulate an optimization problem for the design of the triple tunable features of the surface unit elements, which is solved via a tailored particle swarm optimization approach. Our results showcase that the proposed FIRES architecture significantly outperforms its conventional STAR-RIS counterpart.
中文: 本文提出流体集成反射与发射表面(FIRES)概念,将流体天线系统的灵活性与可重构智能表面的收发双重功能无缝融合,并通过优化设计验证了其在组播系统中显著优于传统可重构智能表面。
English: This letter presents the fluid integrated reflecting and emitting surface (FIRES), a novel metasurface that merges the flexibility of fluid-antenna systems with the dual functionality of STAR-RISs, and demonstrates through optimization and simulations that FIRES significantly surpasses conventional STAR-RIS performance in multicast systems.

Authors:Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg
Title: Granary: Speech Recognition and Translation Dataset in 25 European Languages
Abstract:
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
中文: Granary项目推出了一个涵盖25种欧洲语言的大规模开源语音数据集,用于转录和翻译,通过先进的伪标签技术和数据过滤流程提升数据质量与处理效率,使模型仅需约50%的数据量即可达到相近性能。
English: The Granary project introduces a large-scale, open-source speech dataset for transcription and translation across 25 European languages, employing advanced pseudo-labeling and data filtration techniques to enhance data quality and efficiency, enabling models to achieve comparable performance with approximately 50% less data.

Authors:Wenyu Mao, Zhengyi Yang, Jiancan Wu, Haozhe Liu, Yancheng Yuan, Xiang Wang, Xiangnan He
Title: Addressing Missing Data Issue for Diffusion-based Recommendation
Abstract:
Diffusion models have shown significant potential in generating oracle items that best match user preference with guidance from user historical interaction sequences. However, the quality of guidance is often compromised by unpredictable missing data in observed sequence, leading to suboptimal item generation. Since missing data is uncertain in both occurrence and content, recovering it is impractical and may introduce additional errors. To tackle this challenge, we propose a novel dual-side Thompson sampling-based Diffusion Model (TDM), which simulates extra missing data in the guidance signals and allows diffusion models to handle existing missing data through extrapolation. To preserve user preference evolution in sequences despite extra missing data, we introduce Dual-side Thompson Sampling to implement simulation with two probability models, sampling by exploiting user preference from both item continuity and sequence stability. TDM strategically removes items from sequences based on dual-side Thompson sampling and treats these edited sequences as guidance for diffusion models, enhancing models' robustness to missing data through consistency regularization. Additionally, to enhance the generation efficiency, TDM is implemented under the denoising diffusion implicit models to accelerate the reverse process. Extensive experiments and theoretical analysis validate the effectiveness of TDM in addressing missing data in sequential recommendations.
中文: 提出的TDM模型通过双端汤普森采样模拟额外缺失数据,使扩散模型能够通过外推法处理现有缺失数据,从而在无需实际恢复数据的情况下提升序列推荐中的物品生成质量。
English: The proposed TDM model enhances diffusion models' robustness to unpredictable missing data in user interaction sequences by simulating additional missing data through dual-side Thompson sampling, thereby improving item generation quality without impractical data recovery attempts.

Authors:Ziyu Zhou, Jiaxi Hu, Qingsong Wen, James T. Kwok, Yuxuan Liang
Title: Multi-Order Wavelet Derivative Transform for Deep Time Series Forecasting
Abstract:
In deep time series forecasting, the Fourier Transform (FT) is extensively employed for frequency representation learning. However, it often struggles in capturing multi-scale, time-sensitive patterns. Although the Wavelet Transform (WT) can capture these patterns through frequency decomposition, its coefficients are insensitive to change points in time series, leading to suboptimal modeling. To mitigate these limitations, we introduce the multi-order Wavelet Derivative Transform (WDT) grounded in the WT, enabling the extraction of time-aware patterns spanning both the overall trend and subtle fluctuations. Compared with the standard FT and WT, which model the raw series, the WDT operates on the derivative of the series, selectively magnifying rate-of-change cues and exposing abrupt regime shifts that are particularly informative for time series modeling. Practically, we embed the WDT into a multi-branch framework named WaveTS, which decomposes the input series into multi-scale time-frequency coefficients, refines them via linear layers, and reconstructs them into the time domain via the inverse WDT. Extensive experiments on ten benchmark datasets demonstrate that WaveTS achieves state-of-the-art forecasting accuracy while retaining high computational efficiency.
Chinese: 针对傅里叶变换和小波变换在捕捉多尺度时间敏感模式上的不足,提出了基于小波变换的多阶小波导数变换(WDT),并融入WaveTS框架,在保持高效的同时实现了最先进的预测精度。
English: The multi-order Wavelet Derivative Transform (WDT) is introduced to overcome the limitations of Fourier and Wavelet Transforms in capturing multi-scale, time-sensitive patterns, and it is integrated into the WaveTS framework, which achieves state-of-the-art forecasting accuracy with high efficiency.

Authors:Tianyu Huai, Jie Zhou, Yuxuan Cai, Qin Chen, Wen Wu, Xingjiao Wu, Xipeng Qiu, Liang He
Title: Task-Core Memory Management and Consolidation for Long-term Continual Learning
Abstract:
In this paper, we focus on a long-term continual learning (CL) task, where a model learns sequentially from a stream of vast tasks over time, acquiring new knowledge while retaining previously learned information in a manner akin to human learning. Unlike traditional CL settings, long-term CL involves handling a significantly larger number of tasks, which exacerbates the issue of catastrophic forgetting. Our work seeks to address two critical questions: 1) How do existing CL methods perform in the context of long-term CL? and 2) How can we mitigate the catastrophic forgetting that arises from prolonged sequential updates? To tackle these challenges, we propose a novel framework inspired by human memory mechanisms for long-term continual learning (Long-CL). Specifically, we introduce a task-core memory management strategy to efficiently index crucial memories and adaptively update them as learning progresses. Additionally, we develop a long-term memory consolidation mechanism that selectively retains hard and discriminative samples, ensuring robust knowledge retention. To facilitate research in this area, we construct and release two multi-modal and textual benchmarks, MMLongCL-Bench and TextLongCL-Bench, providing a valuable resource for evaluating long-term CL approaches. Experimental results show that Long-CL outperforms the previous state-of-the-art by 7.4\% and 6.5\% AP on the two benchmarks, respectively, demonstrating the effectiveness of our approach.
中文: 本文提出了一种新颖的长期持续学习框架Long-CL,通过任务核心记忆管理和选择性记忆巩固机制有效缓解灾难性遗忘问题,并在新发布的基准测试中取得了最优性能。
English: This paper introduces a novel long-term continual learning framework, Long-CL, which mitigates catastrophic forgetting through task-core memory management and selective memory consolidation, achieving state-of-the-art performance on newly released benchmarks.

Authors:Yutao Yang, Jie Zhou, Junsong Li, Qianjun Pan, Bihao Zhan, Qin Chen, Xipeng Qiu, Liang He
Title: Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback
Abstract:
This paper introduces an interactive continual learning paradigm where AI models dynamically learn new skills from real-time human feedback while retaining prior knowledge. This paradigm distinctively addresses two major limitations of traditional continual learning: (1) dynamic model updates using streaming, real-time human-annotated data, rather than static datasets with fixed labels, and (2) the assumption of clean labels, by explicitly handling the noisy feedback common in real-world interactions. To tackle these problems, we propose RiCL, a Reinforced interactive Continual Learning framework leveraging Large Language Models (LLMs) to learn new skills effectively from dynamic feedback. RiCL incorporates three key components: a temporal consistency-aware purifier to automatically discern clean from noisy samples in data streams; an interaction-aware direct preference optimization strategy to align model behavior with human intent by reconciling AI-generated and human-provided feedback; and a noise-resistant contrastive learning module that captures robust representations by exploiting inherent data relationships, thus avoiding reliance on potentially unreliable labels. Extensive experiments on two benchmark datasets (FewRel and TACRED), contaminated with realistic noise patterns, demonstrate that our RiCL approach substantially outperforms existing combinations of state-of-the-art online continual learning and noisy-label learning methods.
中文: 本文提出RiCL强化交互式持续学习框架,通过时间一致性净化器、交互感知偏好优化和抗噪对比学习三大组件,使AI模型能够从实时人类反馈中动态学习新技能并保持原有知识,在噪声标注环境下显著优于现有方法。
English: This paper presents RiCL, a reinforced interactive continual learning framework that enables AI models to dynamically acquire new skills from real-time human feedback while maintaining prior knowledge, effectively addressing noisy labels and outperforming existing methods on benchmark datasets.

Authors:Fan Zhang, Tianyu Liu, Zhihong Zhu, Hao Wu, Haixin Wang, Donghao Zhou, Yefeng Zheng, Kun Wang, Xian Wu, Pheng-Ann Heng
Title: CellVerse: Do Large Language Models Really Understand Cell Biology?
Abstract:
Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis.
中文摘要:近期研究尝试将单细胞数据建模为自然语言并利用大语言模型理解细胞生物学,但缺乏系统性评估;为此引入CellVerse基准测试,发现现有大语言模型在细胞生物学任务中表现欠佳,仍面临重大挑战。
English Summary: Recent research explores using large language models (LLMs) for single-cell biology analysis, but a comprehensive evaluation is lacking, leading to the creation of CellVerse, a benchmark that reveals current LLMs' limited performance and significant challenges in this domain.

Authors:Hongming Wang, Yifeng Wu, Huimin Huang, Hongtao Wu, Jia-Xuan Jiang, Xiaodong Zhang, Hao Zheng, Xian Wu, Yefeng Zheng, Jinping Xu, Jing Cheng
Title: BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation
Abstract:
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.
中文:BrainSegDMLF模型通过动态模态交互融合模块整合多模态数据,解决了脑部病变分割中对小病灶检测不足和依赖外部提示的问题,实现了全自动精准分割。
English: The BrainSegDMLF model addresses key limitations in brain lesion segmentation by integrating multi-modal data through a Dynamic Modal Interactive Fusion module, enabling automatic detection of small lesions without manual prompts.

Authors:Tianyi Liao, Wei Guo, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Fluid Antenna-Assisted MU-MIMO Systems with Decentralized Baseband Processing
Abstract:
The fluid antenna system (FAS) has emerged as a disruptive technology, offering unprecedented degrees of freedom (DoF) for wireless communication systems. However, optimizing fluid antenna (FA) positions entails significant computational costs, especially when the number of FAs is large. To address this challenge, we introduce a decentralized baseband processing (DBP) architecture to FAS, which partitions the FA array into clusters and enables parallel processing. Based on the DBP architecture, we formulate a weighted sum rate (WSR) maximization problem through joint beamforming and FA position design for FA-assisted multiuser multiple-input multiple-output (MU-MIMO) systems. To solve the WSR maximization problem, we propose a novel decentralized block coordinate ascent (BCA)-based algorithm that leverages matrix fractional programming (FP) and majorization-minimization (MM) methods. The proposed decentralized algorithm achieves low computational, communication, and storage costs, thus unleashing the potential of the DBP architecture. Simulation results show that our proposed algorithm under the DBP architecture reduces computational time by over 70% compared to centralized architectures with negligible WSR performance loss.
中文摘要:本研究为流体天线系统引入了一种去中心化基带处理架构,通过新算法在保证多用户MIMO系统性能的同时,将计算时间降低了70%以上。
English Summary: The study introduces a decentralized baseband processing architecture for fluid antenna systems, employing a novel algorithm that significantly reduces computational time by over 70% while maintaining near-optimal performance in multiuser MIMO setups.

Authors:Erqiang Tang, Wei Guo, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Accurate and Fast Channel Estimation for Fluid Antenna Systems with Diffusion Models
Abstract:
Fluid antenna systems (FAS) offer enhanced spatial diversity for next-generation wireless systems. However, acquiring accurate channel state information (CSI) remains challenging due to the large number of reconfigurable ports and the limited availability of radio-frequency (RF) chains -- particularly in high-dimensional FAS scenarios. To address this challenge, we propose an efficient posterior sampling-based channel estimator that leverages a diffusion model (DM) with a simplified U-Net architecture to capture the spatial correlation structure of two-dimensional FAS channels. The DM is initially trained offline in an unsupervised way and then applied online as a learned implicit prior to reconstruct CSI from partial observations via posterior sampling through a denoising diffusion restoration model (DDRM). To accelerate the online inference, we introduce a skipped sampling strategy that updates only a subset of latent variables during the sampling process, thereby reducing the computational cost with minimal accuracy degradation. Simulation results demonstrate that the proposed approach achieves significantly higher estimation accuracy and over 20x speedup compared to state-of-the-art compressed sensing-based methods, highlighting its potential for practical deployment in high-dimensional FAS.
中文: 该研究提出基于扩散模型的信道估计方法,通过简化U-Net架构和跳跃采样策略,在二维流体天线系统中实现高效信道重建,相比现有技术精度显著提升且速度加快20倍以上。
English: The proposed diffusion model-based channel estimator efficiently reconstructs high-dimensional fluid antenna system channels using a simplified U-Net and skipped sampling strategy, achieving superior accuracy with over 20x speedup compared to existing methods.

Authors:Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang
Title: am-ELO: A Stable Framework for Arena-based LLM Evaluation
Abstract:
Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating's probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
Chinese: 本文提出了一种稳定的竞技场框架,通过最大似然估计(m-ELO)和融入标注者能力(am-ELO)来改进ELO评分系统,为大型语言模型提供了更稳健、准确的评估方法。
English: This paper introduces a stable arena framework that enhances the ELO rating system by using Maximum Likelihood Estimation (m-ELO) and incorporating annotator abilities (am-ELO), providing a more robust and accurate evaluation method for large language models.

Authors:Qianjun Pan, Wenkai Ji, Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He
Title: A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law
Abstract:
This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" - a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human-like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test-time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision-making through iterative improvement leveraging policy networks, reward models, and self-evolution strategies; and (3) slow-thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem-solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real-world applications, from scientific discovery to decision support systems.
中文摘要:本综述探讨了模拟人类“慢思考”的推理大语言模型,通过动态调整计算资源处理复杂任务,将方法分为测试时扩展、强化学习和结构化框架三类,以提升问题解决能力。
English Summary: This survey examines reasoning large language models that emulate human "slow thinking" by dynamically scaling computational resources for complex tasks, categorizing methods into test-time scaling, reinforced learning, and structured frameworks to enhance problem-solving capabilities.

Authors:Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, Mingyu Ding
Title: Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions
Abstract:
Vision-Language-Action (VLA) models have shown great promise for generalist robotic manipulation in the physical world. However, existing models are restricted to robot observations and text-only instructions, lacking the flexibility of interleaved multimodal instructions enabled by recent advances in foundation models in the digital world. In this paper, we present Interleave-VLA, the first framework capable of comprehending interleaved image-text instructions and directly generating continuous action sequences in the physical world. It offers a flexible, model-agnostic paradigm that extends state-of-the-art VLA models with minimal modifications and strong zero-shot generalization. A key challenge in realizing Interleave-VLA is the absence of large-scale interleaved embodied datasets. To bridge this gap, we develop an automatic pipeline that converts text-only instructions from real-world datasets in Open X-Embodiment into interleaved image-text instructions, resulting in the first large-scale real-world interleaved embodied dataset with 210k episodes. Through comprehensive evaluation on simulation benchmarks and real-robot experiments, we demonstrate that Interleave-VLA offers significant benefits: 1) it improves out-of-domain generalization to unseen objects by 2-3x compared to state-of-the-art baselines, 2) supports flexible task interfaces, and 3) handles diverse user-provided image instructions in a zero-shot manner, such as hand-drawn sketches. We further analyze the factors behind Interleave-VLA's strong zero-shot performance, showing that the interleaved paradigm effectively leverages heterogeneous datasets and diverse instruction images, including those from the Internet, which demonstrates strong potential for scaling up. Our model and dataset will be open-sourced.
中文: Interleave-VLA提出了一种创新的机器人学习范式,通过交错图像-文本指令提升泛化能力和交互性,在未见任务中实现了卓越的零样本性能。
English: Interleave-VLA introduces a novel robot learning paradigm that uses interleaved image-text instructions to enhance generalization and interaction, achieving superior zero-shot performance in unseen tasks.

Authors:Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, Mingyu Ding
Title: Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions
Abstract:
The rise of foundation models paves the way for generalist robot policies in the physical world. Existing methods relying on text-only instructions often struggle to generalize to unseen scenarios. We argue that interleaved image-text inputs offer richer and less biased context and enable robots to better handle unseen tasks with more versatile human-robot interaction. Building on this insight, Interleave-VLA, the first robot learning paradigm capable of comprehending interleaved image-text instructions and directly generating continuous action sequences in the physical world, is introduced. It offers a natural, flexible, and model-agnostic paradigm that extends state-of-the-art vision-language-action (VLA) models with minimal modifications while achieving strong zero-shot generalization. Interleave-VLA also includes an automatic pipeline that converts text instructions from Open X-Embodiment into interleaved image-text instructions, resulting in a large-scale real-world interleaved embodied dataset with 210k episodes. Comprehensive evaluation in simulation and the real world shows that Interleave-VLA offers two major benefits: (1) improves out-of-domain generalization to unseen objects by 2x compared to text input baselines, (2) supports flexible task interfaces and diverse instructions in a zero-shot manner, such as hand-drawn sketches. We attribute Interleave-VLA's strong zero-shot capability to the use of instruction images, which effectively mitigate hallucinations, and the inclusion of heterogeneous multimodal datasets, enriched with Internet-sourced images, offering potential for scalability. More information is available at https://interleave-vla.github.io/Interleave-VLA-Anonymous/
中文: Interleave-VLA提出了一种创新的机器人学习范式,通过交错图像-文本指令提升泛化能力和交互性,在未见任务中实现了卓越的零样本性能。
English: Interleave-VLA introduces a novel robot learning paradigm that uses interleaved image-text instructions to enhance generalization and interaction, achieving superior zero-shot performance in unseen tasks.

Authors:Wei Chen, Jiahao Zhang, Haipeng Zhu, Boyan Xu, Zhifeng Hao, Keli Zhang, Junjian Ye, Ruichu Cai
Title: Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting
Abstract:
Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a ``learning-adapting-acting" paradigm. Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently,in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLMs to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game ``Crafter" validate the effectiveness of our proposed method.
中文: 因果感知大语言模型通过整合结构因果模型,采用“学习-适应-行动”范式迭代学习和更新环境知识,从而增强推理与适应能力,在复杂任务中实现更高效的决策。
English: Causal-aware LLMs integrate structural causal models to enhance reasoning and adaptability in decision-making, iteratively learning and updating environmental knowledge through a "learning-adapting-acting" paradigm for improved performance in complex tasks.

Authors:Yike Wang, Shangbin Feng, Yulia Tsvetkov, Hannaneh Hajishirzi
Title: ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
Abstract:
Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
中文:ScienceMeter是一个评估大语言模型在科学知识保存、获取和预测方面表现的框架,揭示了现有方法在跨领域更新模型知识时存在显著不足。
English: ScienceMeter is a framework that assesses how well Large Language Models preserve, acquire, and project scientific knowledge, revealing current methods' limitations in keeping models up-to-date across diverse fields.

Authors:Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Abstract:
Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
中文: 针对多模态模型在多轮交互中的不足,本研究提出ContextQFormer增强上下文建模,并构建了长对话数据集TMDialog,实验表明其性能比基线模型提升2%-4%。
English: To enhance multi-modal models' weak multi-turn interaction capabilities, this work introduces ContextQFormer for better contextual representation and creates TMDialog, a long-conversation dataset that shows a 2%-4% improvement in performance over baselines.

Authors:Mian Muhammad Naeem Abid, Nancy Mehta, Zongwei Wu, Radu Timofte
Title: LeMoRe: Learn More Details for Lightweight Semantic Segmentation
Abstract:
Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.
中文摘要:提出的LeMoRe方法通过显隐式建模协同与嵌套注意力机制,在多个挑战性数据集上实现了轻量级语义分割性能与效率的有效平衡。
English Summary: The proposed LeMoRe method synergizes explicit and implicit modeling with a nested attention mechanism to effectively balance performance and efficiency in lightweight semantic segmentation across multiple challenging datasets.

Authors:Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, Jianmin Ji
Title: SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images
Abstract:
A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.
中文:SpatialSplat提出了一种前馈框架,通过双场语义表示和选择性高斯机制减少冗余并提升细粒度语义,在减少60%参数的同时实现了优于现有方法的性能。
English: SpatialSplat introduces a feedforward framework that uses dual-field semantic representation and selective Gaussian mechanisms to reduce redundancy and enhance fine-grained semantics, achieving a 60% reduction in parameters while outperforming existing methods.

Authors:Jianwei Wang, Mengqi Wang, Yinsi Zhou, Zhenchang Xing, Qing Liu, Xiwei Xu, Wenjie Zhang, Liming Zhu
Title: LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements
Abstract:
Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.
中文: HSE-Bench作为首个评估大语言模型在健康安全环境合规性评估能力的基准,揭示了模型依赖语义匹配而非原则性推理的局限,并提出专家推理提示技术以提升专业决策水平。
English: HSE-Bench is introduced as the first benchmark to evaluate LLMs' HSE compliance assessment capabilities, revealing their reliance on semantic matching over principled reasoning and proposing the RoE prompting technique to enhance expert-like decision-making.

Authors:Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting
Title: Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Abstract:
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
中文: JQL提出了一种系统性方法,利用从大语言模型提炼的轻量级标注器高效筛选高质量多语言数据,在35种语言中显著优于启发式过滤法并提升模型训练效果。
English: JQL introduces a systematic method using distilled lightweight annotators from LLMs to efficiently curate high-quality multilingual data, outperforming heuristic filters and improving model training across 35 languages.

Authors:Patrick Gerard, Hans W. A. Hanley, Luca Luceri, Emilio Ferrara
Title: Bridging the Narrative Divide: Cross-Platform Discourse Networks in Fragmented Ecosystems
Abstract:
Political discourse has grown increasingly fragmented across different social platforms, making it challenging to trace how narratives spread and evolve within such a fragmented information ecosystem. Reconstructing social graphs and information diffusion networks is challenging, and available strategies typically depend on platform-specific features and behavioral signals which are often incompatible across systems and increasingly restricted. To address these challenges, we present a platform-agnostic framework that allows to accurately and efficiently reconstruct the underlying social graph of users' cross-platform interactions, based on discovering latent narratives and users' participation therein. Our method achieves state-of-the-art performance in key network-based tasks: information operation detection, ideological stance prediction, and cross-platform engagement prediction$\unicode{x2013}$$\unicode{x2013}$while requiring significantly less data than existing alternatives and capturing a broader set of users. When applied to cross-platform information dynamics between Truth Social and X (formerly Twitter), our framework reveals a small, mixed-platform group of $\textit{bridge users}$, comprising just 0.33% of users and 2.14% of posts, who introduce nearly 70% of $\textit{migrating narratives}$ to the receiving platform. These findings offer a structural lens for anticipating how narratives traverse fragmented information ecosystems, with implications for cross-platform governance, content moderation, and policy interventions.
中文摘要:本研究提出了一种跨平台通用框架,通过分析潜在叙事重建社交图谱,在信息操作检测和立场预测等任务中实现最优性能,同时揭示仅占0.33%的桥梁用户如何推动近70%的跨平台叙事迁移。
English Summary: This study introduces a platform-agnostic framework that reconstructs social graphs by analyzing latent narratives, achieving superior performance in detecting information operations and predicting ideological stances with minimal data, while revealing how a small group of bridge users drives cross-platform narrative migration.

Authors:Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yushuo Guan, Zhang Zhang, Liang Wang, Haoxuan Li, Zhouchen Lin, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Wenjing Yang
Title: MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
Abstract:
Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.
中文: 多模态大语言模型在静态图像OCR中表现出色,但在视频OCR中因运动模糊、时序变化等因素效果显著下降,为此我们推出了全面的MME-VideoOCR基准测试,旨在评估和指导视频文本识别与理解能力的提升。
English: Multimodal Large Language Models excel at OCR in static images but struggle with video OCR due to challenges like motion blur and temporal changes, prompting the introduction of the comprehensive MME-VideoOCR benchmark to evaluate and guide improvements in video text recognition and comprehension.

Authors:Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He
Title: HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling
Abstract:
Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 $\times$ larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.
中文: 本研究提出了一种混合架构蒸馏方法,通过结合蒸馏和重建任务优化DNA序列建模,使紧凑模型在部分任务上超越了规模大500倍的教师模型,展现出卓越性能。
English: This study introduces a Hybrid Architecture Distillation (HAD) method that enhances DNA sequence modeling by combining distillation and reconstruction tasks, achieving superior performance with a compact model that even surpasses its much larger teacher model on certain tasks.

Authors:Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam, Yuki Mitsufuji
Title: Can Large Language Models Predict Audio Effects Parameters from Natural Language?
Abstract:
In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual descriptions without requiring task-specific training or fine-tuning. Our approach address the text-to-effect parameter prediction (Text2Fx) task by mapping natural language descriptions to the corresponding Fx parameters for equalization and reverberation. We demonstrate that LLMs can generate Fx parameters in a zero-shot manner that elucidates the relationship between timbre semantics and audio effects in music production. To enhance performance, we introduce three types of in-context examples: audio Digital Signal Processing (DSP) features, DSP function code, and few-shot examples. Our results demonstrate that LLM-based Fx parameter generation outperforms previous optimization approaches, offering competitive performance in translating natural language descriptions to appropriate Fx settings. Furthermore, LLMs can serve as text-driven interfaces for audio production, paving the way for more intuitive and accessible music production tools.
中文: LLM2Fx框架利用大型语言模型将自然语言描述直接转换为均衡和混响的音频效果参数,实现了零样本的文本到效果映射,其性能优于传统优化方法,为音乐制作提供了更直观的交互界面。
English: LLM2Fx is a framework that uses Large Language Models to translate natural language descriptions into audio effect parameters for equalization and reverberation, enabling zero-shot text-to-effect mapping and outperforming prior optimization methods for intuitive music production.

Authors:Xing Cui, Yueying Zou, Zekun Li, Peipei Li, Xinyuan Xu, Xuannan Liu, Huaibo Huang, Ran He
Title: T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search
Abstract:
Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose T2Agent, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a Bayesian optimization-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, T2Agent extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that T2Agent consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free approach for enhancing detection accuracy. The code will be released.
中文: T2Agent提出了一种自适应虚假信息检测框架,通过可扩展工具包和结合多源验证的蒙特卡洛树搜索,有效应对混合来源的多模态虚假信息,在无需训练的情况下显著超越现有方法。
English: T2Agent introduces an adaptive misinformation detection framework using an extensible toolkit and enhanced Monte Carlo Tree Search with multi-source verification to effectively tackle mixed-source multimodal misinformation, outperforming existing methods without requiring training.

Authors:Yigitcan Özer, Woosung Choi, Joan SerrÃ, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji
Title: A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?
Abstract:
We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with various distortions such as compression, background noise, and reverberation, along with a diverse test dataset including speech, environmental sounds, and music recordings. Evaluating four existing watermarking methods on RAW-bench reveals two main insights: (i) neural compression techniques pose the most significant challenge, even when algorithms are trained with such compressions; and (ii) training with audio attacks generally improves robustness, although it is insufficient in some cases. Furthermore, we find that specific distortions, such as polarity inversion, time stretching, or reverb, seriously affect certain methods. The evaluation framework is accessible at github.com/SonyResearch/raw_bench.
中文: 我们推出RAW-Bench这一标准化基准,通过模拟真实场景全面评估基于深度学习的音频水印方法,发现神经压缩技术是主要挑战,且尽管对抗训练能普遍提升鲁棒性,但对极性反转等特定失真仍显不足。
English: We introduce RAW-Bench, a standardized benchmark for evaluating deep learning-based audio watermarking methods through comprehensive real-world simulations, revealing neural compression as the primary challenge and showing that while training with attacks generally improves robustness, it remains insufficient against certain distortions like polarity inversion.

Authors:Varun Jain, Zongwei Wu, Quan Zou, Louis Florentin, Henrik Turbell, Sandeep Siddhartha, Radu Timofte, others
Title: NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and Results
Abstract:
This paper presents a comprehensive review of the 1st Challenge on Video Quality Enhancement for Video Conferencing held at the NTIRE workshop at CVPR 2025, and highlights the problem statement, datasets, proposed solutions, and results. The aim of this challenge was to design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, (c) reducing noise, and (d) enhancing sharpness - giving a professional studio-like effect. Participants were given a differentiable Video Quality Assessment (VQA) model, training, and test videos. A total of 91 participants registered for the challenge. We received 10 valid submissions that were evaluated in a crowdsourced framework.
中文: 本文综述了NTIRE 2025视频质量增强挑战赛,重点介绍了通过提升光照、色彩、降噪和锐化来优化视频会议质量的方案,共有91名参与者提交了10份有效作品进行评估。
English: This paper reviews the NTIRE 2025 Video Quality Enhancement Challenge, detailing its objectives to improve lighting, color, noise, and sharpness in video conferencing through 10 evaluated submissions from 91 participants.

Authors:Sangwoo Park, Matteo Zecchin, Osvaldo Simeone
Title: Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Abstract:
Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (\texttt{PPI}) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose \texttt{R-AutoEval+}, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of \texttt{R-AutoEval+} is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, and for prompt design in LLMs confirm the reliability and efficiency of \texttt{R-AutoEval+}.
中文:R-AutoEval+框架通过动态调整对合成数据的依赖,在保证样本效率的同时提供可靠的AI模型评估,在LLM优化和提示设计的实验中优于传统方法。
English: The proposed R-AutoEval+ framework dynamically adjusts its reliance on synthetic data to ensure reliable AI model evaluation with guaranteed sample efficiency, outperforming conventional methods in experiments involving LLM optimization and prompt design.

Authors:Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu
Title: MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
Abstract:
Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.
中文:MMMG是一个全面的多模态生成基准,旨在通过涵盖四种模态组合和49项任务来使自动评估与人类判断一致,验证显示其与人类评估的一致性达94.3%,并揭示了如GPT Image等模型在多模态推理和音频生成方面的改进空间。
English: MMMG is a comprehensive benchmark designed to align automated evaluation with human judgment across multimodal generation tasks, covering four modality combinations and 49 tasks to assess reasoning and controllability, with validation showing 94.3% agreement with human evaluation and highlighting areas for improvement in models like GPT Image and audio generation.

Authors:Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu
Title: Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Abstract:
Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We demonstrate that these metrics are often misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This phenomenon of \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA-based similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across six unlearning methods, three data domains, and two LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. Our analysis reveals that achieving the ideal state--irreversible, non-catastrophic forgetting--is exceptionally challenging. By probing the limits of unlearning, we identify a case of seemingly irreversible, targeted forgetting, offering new insights for designing more robust erasure algorithms. Our findings expose a fundamental gap in current evaluation practices and establish a representation-level foundation for trustworthy unlearning.
中文: 当前大语言模型遗忘效果评估依赖的词汇级指标存在误导性,因为模型可能只是表面遗忘而信息仍可恢复,这凸显了需要建立表征分析框架来区分可逆与不可逆遗忘的必要性。
English: Current token-level metrics for evaluating unlearning in LLMs can be misleading, as models may only superficially forget information that remains recoverable, prompting the need for a new representational analysis framework to distinguish between reversible and irreversible forgetting.

Authors:Zihan Chen, Song Wang, Zhen Tan, Jundong Li, Cong Shen
Title: MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning
Abstract:
In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose Many-Shot Adaptive Pseudo-LabEling, namely MAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data.
中文:MAPLE框架通过基于影响力的伪标记技术对未标注样本进行处理,增强了大语言模型的多示例上下文学习能力,从而在有限标注成本下显著提升模型性能。
English: The MAPLE framework enhances many-shot in-context learning by using influence-based pseudo-labeling on selected unlabeled samples, enabling LLMs to achieve improved performance with minimal labeling costs.

Authors:Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, Ran He
Title: Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention
Abstract:
Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.
中文: LAformer是一种高效的图像复原Transformer模型,通过整合秩增强线性注意力和通道注意力实现全局上下文建模,同时利用卷积门控前馈网络增强局部拟合能力,在多个基准测试中超越现有最优方法并显著提升计算效率。
English: LAformer is an efficient Transformer model for image restoration that overcomes the quadratic complexity of self-attention by integrating Rank Enhanced Linear Attention and channel attention for global context modeling, while using a convolutional gated feed-forward network to enhance local capabilities, achieving state-of-the-art performance across multiple benchmarks with significant computational efficiency.

Authors:Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, Ji-Rong Wen
Title: Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization
Abstract:
Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
中文: 本文提出RoleRAG统一框架,通过角色特定令牌优化将检索增强生成的多个子任务整合,实现在单一大型语言模型中的高效多任务处理,并在五个开放域问答数据集上验证了其有效性。
English: This paper introduces RoleRAG, a unified framework that integrates multiple retrieval-augmented generation optimizations through role-specific token optimization, enabling efficient multi-task processing within a single LLM while demonstrating strong performance across five question-answering datasets.

Authors:Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao
Title: Unlocking the Power of SAM 2 for Few-Shot Segmentation
Abstract:
Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.
中文: 为解决少样本分割与SAM 2匹配机制的不兼容问题,本研究提出伪提示生成器与集成迭代内存优化及支持校准记忆注意力机制,在1-shot mIoU指标上较最佳基线显著提升4.2%的分割精度。
English: To address the incompatibility between Few-Shot Segmentation and SAM 2's matching mechanism, this study introduces a Pseudo Prompt Generator and Iterative Memory Refinement with Support-Calibrated Memory Attention, significantly improving segmentation accuracy by 4.2% in 1-shot mIoU over top baselines.

Authors:Mingrui Chen, Haogeng Liu, Hao Liang, Huaibo Huang, Wentao Zhang, Ran He
Title: Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning
Abstract:
In this work, we investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
中文摘要:本研究通过离线数据筛选、在线优势差分和难度提示三重策略,显式建模问题难度信息,有效提升了多模态推理的强化学习微调效果,仅用少量训练数据即在多个数学推理基准上取得显著性能提升。
English Summary: This study demonstrates that explicitly modeling problem difficulty through offline data curation, online advantage differentiation, and difficulty hints significantly enhances reinforcement learning fine-tuning for multimodal reasoning, achieving strong performance with minimal training data.

Authors:Hao Fang, Kai Huang, Hao Ye, Chongtao Guo, Le Liang, Xiao Li, Shi Jin
Title: Power Allocation for Delay Optimization in Device-to-Device Networks: A Graph Reinforcement Learning Approach
Abstract:
The pursuit of rate maximization in wireless communication frequently encounters substantial challenges associated with user fairness. This paper addresses these challenges by exploring a novel power allocation approach for delay optimization, utilizing graph neural networks (GNNs)-based reinforcement learning (RL) in device-to-device (D2D) communication. The proposed approach incorporates not only channel state information but also factors such as packet delay, the number of backlogged packets, and the number of transmitted packets into the components of the state information. We adopt a centralized RL method, where a central controller collects and processes the state information. The central controller functions as an agent trained using the proximal policy optimization (PPO) algorithm. To better utilize topology information in the communication network and enhance the generalization of the proposed method, we embed GNN layers into both the actor and critic networks of the PPO algorithm. This integration allows for efficient parameter updates of GNNs and enables the state information to be parameterized as a low-dimensional embedding, which is leveraged by the agent to optimize power allocation strategies. Simulation results demonstrate that the proposed method effectively reduces average delay while ensuring user fairness, outperforms baseline methods, and exhibits scalability and generalization capability.
中文摘要:本文提出一种基于图神经网络强化学习的功率分配方法,通过整合信道状态、数据包延迟等状态信息并利用网络拓扑结构,在设备间通信中有效降低平均延迟同时保障用户公平性。
English Summary: This paper introduces a GNN-based reinforcement learning approach for optimizing power allocation in D2D communications, which effectively reduces average delay while maintaining user fairness by incorporating comprehensive state information and network topology.

Authors:Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Ronghua Li
Title: Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?
Abstract:
Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA's training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.
Chinese: 理论分析与实验证实,LoRA因其简化的信息几何结构,在对抗后门攻击时表现出更强的鲁棒性,但对无目标数据投毒攻击更为脆弱。
English: LoRA demonstrates enhanced robustness against backdoor attacks but increased vulnerability to untargeted data poisoning due to its simplified information geometry, as confirmed by theoretical analysis and experiments.

Authors:Baozhu Huang, Cheng Chen, Xuanhe Hou, Junmin Huang, Zihan Wei, Hongying Luo, Lu Chen, Yongzhi Xu, Hejiao Luo, Changqi Qin, Ziqian Bi, Junhao Song, Tianyang Wang, ChiaXin Liang, Zizhong Yu, Han Wang, Xiaotian Sun, Junfeng Hao, Chunjie Tian
Title: Early Prediction of In-Hospital ICU Mortality Using Innovative First-Day Data: A Review
Abstract:
The intensive care unit (ICU) manages critically ill patients, many of whom face a high risk of mortality. Early and accurate prediction of in-hospital mortality within the first 24 hours of ICU admission is crucial for timely clinical interventions, resource optimization, and improved patient outcomes. Traditional scoring systems, while useful, often have limitations in predictive accuracy and adaptability. Objective: This review aims to systematically evaluate and benchmark innovative methodologies that leverage data available within the first day of ICU admission for predicting in-hospital mortality. We focus on advancements in machine learning, novel biomarker applications, and the integration of diverse data types.
中文: 本综述系统评估了利用ICU入院首日数据改进院内死亡预测的创新方法,重点关注机器学习技术、新型生物标志物应用及多源数据整合,以突破传统评分系统的局限性。
English: This review systematically evaluates innovative methods using first-day ICU data to enhance in-hospital mortality prediction, focusing on machine learning, novel biomarkers, and multi-source data integration to overcome traditional scoring limitations.

Authors:Jia Li, Nan Gao, Huaibo Huang, Ran He
Title: NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation
Abstract:
The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.
Chinese: 本文提出NOFT模块,通过最优传输信息瓶颈微调噪声,在Stable Diffusion中仅需少量参数和训练时间即可生成拓扑纹理对齐的多样化高保真图像,有效优化2D/3D AIGC资源。
English: The paper introduces NOFT, a plug-and-play module that fine-tunes noise in Stable Diffusion using an optimal-transported information bottleneck to efficiently generate diverse, high-fidelity images with aligned topology and texture, requiring minimal parameters and training time.

Authors:Renqi Chen, Haoyang Su, Shixiang Tang, Zhenfei Yin, Qi Wu, Hui Li, Ye Sun, Nanqing Dong, Wanli Ouyang, Philip Torr
Title: AI-Driven Automation Can Become the Foundation of Next-Era Science of Science Research
Abstract:
The Science of Science (SoS) explores the mechanisms underlying scientific discovery, and offers valuable insights for enhancing scientific efficiency and fostering innovation. Traditional approaches often rely on simplistic assumptions and basic statistical tools, such as linear regression and rule-based simulations, which struggle to capture the complexity and scale of modern research ecosystems. The advent of artificial intelligence (AI) presents a transformative opportunity for the next generation of SoS, enabling the automation of large-scale pattern discovery and uncovering insights previously unattainable. This paper offers a forward-looking perspective on the integration of Science of Science with AI for automated research pattern discovery and highlights key open challenges that could greatly benefit from AI. We outline the advantages of AI over traditional methods, discuss potential limitations, and propose pathways to overcome them. Additionally, we present a preliminary multi-agent system as an illustrative example to simulate research societies, showcasing AI's ability to replicate real-world research patterns and accelerate progress in Science of Science research.
Chinese: 本文探讨了人工智能如何通过自动化研究模式发现、克服传统方法局限来革新科学学,并提出了一个模拟科研社会的多智能体系统示例。
English: This paper discusses how artificial intelligence can revolutionize the Science of Science by enabling automated discovery of research patterns, overcoming the limitations of traditional methods, and presents a multi-agent system to simulate research societies.

Authors:Ruikun Li, Yan Lu, Shixiang Tang, Biqing Qi, Wanli Ouyang
Title: MLLM-based Discovery of Intrinsic Coordinates and Governing Equations from High-Dimensional Data
Abstract:
Discovering governing equations from scientific data is crucial for understanding the evolution of systems, and is typically framed as a search problem within a candidate equation space. However, the high-dimensional nature of dynamical systems leads to an exponentially expanding equation space, making the search process extremely challenging. The visual perception and pre-trained scientific knowledge of multimodal large language models (MLLM) hold promise for providing effective navigation in high-dimensional equation spaces. In this paper, we propose a zero-shot method based on MLLM for automatically discovering physical coordinates and governing equations from high-dimensional data. Specifically, we design a series of enhanced visual prompts for MLLM to enhance its spatial perception. In addition, MLLM's domain knowledge is employed to navigate the search process within the equation space. Quantitative and qualitative evaluations on two representative types of systems demonstrate that the proposed method effectively discovers the physical coordinates and equations from both simulated and real experimental data, with long-term extrapolation accuracy improved by approximately 26.96% compared to the baseline.
中文: 本文提出一种基于多模态大语言模型的零样本方法,通过增强视觉提示和领域知识从高维数据中自动发现物理坐标与支配方程,在长期外推精度上较基线提升约26.96%。
English: This paper introduces a zero-shot method using multimodal large language models (MLLM) to automatically discover physical coordinates and governing equations from high-dimensional data, leveraging enhanced visual prompts and domain knowledge to improve search efficiency and achieve a 26.96% accuracy gain in long-term extrapolation over baselines.

Authors:Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
Title: Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Abstract:
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
中文摘要:Video-SafetyBench作为首个全面评估大型视觉语言模型在视频文本攻击下安全性的基准,通过2264个测试样本和新颖评估指标,揭示了模型在视频诱导攻击中的显著脆弱性。
English Summary: Video-SafetyBench is introduced as the first comprehensive benchmark to evaluate Large Vision-Language Models' safety against video-text attacks, revealing significant vulnerabilities through 2,264 test pairs and a novel evaluation metric.

Authors:Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas
Title: Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Abstract:
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method's superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
中文: 本研究引入基于DiffVox数据集的先验高斯分布来改进ST-ITO音频风格迁移方法,通过降低参数误差和更准确匹配参考风格,在各项指标上显著超越了基线方法。
English: The study introduces a Gaussian prior from the DiffVox dataset to enhance ST-ITO for audio effects transfer, significantly improving realism and performance over baselines by reducing parameter error and better matching reference styles.

Authors:Luca Luceri, Tanishq Vijay Salkar, Ashwin Balasubramanian, Gabriela Pinto, Chenning Sun, Emilio Ferrara
Title: Coordinated Inauthentic Behavior on TikTok: Challenges and Opportunities for Detection in a Video-First Ecosystem
Abstract:
Detecting coordinated inauthentic behavior (CIB) is central to the study of online influence operations. However, most methods focus on text-centric platforms, leaving video-first ecosystems like TikTok largely unexplored. To address this gap, we develop and evaluate a computational framework for detecting CIB on TikTok, leveraging a network-based approach adapted to the platform's unique content and interaction structures. Building on existing approaches, we construct user similarity networks based on shared behaviors, including synchronized posting, repeated use of similar captions, multimedia content reuse, and hashtag sequence overlap, and apply graph pruning techniques to identify dense networks of likely coordinated accounts. Analyzing a dataset of 793K TikTok videos related to the 2024 U.S. Presidential Election, we uncover a range of coordinated activities, from synchronized amplification of political narratives to semi-automated content replication using AI-generated voiceovers and split-screen video formats. Our findings show that while traditional coordination indicators generalize well to TikTok, other signals, such as those based on textual similarity of video transcripts or Duet and Stitch interactions, prove ineffective, highlighting the platform's distinct content norms and interaction mechanics. This work provides the first empirical foundation for studying and detecting CIB on TikTok, paving the way for future research into influence operations in short-form video platforms.
中文: 本研究通过分析用户相似性网络,开发了一个检测TikTok上协同虚假行为的计算框架,揭示了传统指标的有效性及平台特有的内容和互动模式挑战。
English: This study introduces a computational framework for detecting coordinated inauthentic behavior on TikTok by analyzing user similarity networks, revealing both effective traditional indicators and platform-specific challenges in content and interaction patterns.

Authors:Luca Luceri, Tanishq Vijay Salkar, Ashwin Balasubramanian, Gabriela Pinto, Chenning Sun, Emilio Ferrara
Title: Coordinated Inauthentic Behavior on TikTok: Challenges and Opportunities for Detection in a Video-First Ecosystem
Abstract:
Detecting coordinated inauthentic behavior (CIB) is central to the study of online influence operations. However, most methods focus on text-centric platforms, leaving video-first ecosystems like TikTok largely unexplored. To address this gap, we develop and evaluate a computational framework for detecting CIB on TikTok, leveraging a network-based approach adapted to the platform's unique content and interaction structures. Building on existing approaches, we construct user similarity networks based on shared behaviors, including synchronized posting, repeated use of similar captions, multimedia content reuse, and hashtag sequence overlap, and apply graph pruning techniques to identify dense networks of likely coordinated accounts. Analyzing a dataset of 793K TikTok videos related to the 2024 U.S. Presidential Election, we uncover a range of coordinated activities, from synchronized amplification of political narratives to semi-automated content replication using AI-generated voiceovers and split-screen video formats. Our findings show that while traditional coordination indicators generalize well to TikTok, other signals, such as those based on textual similarity of video transcripts or Duet and Stitch interactions, prove ineffective, highlighting the platform's distinct content norms and interaction mechanics. This work provides the first empirical foundation for studying and detecting CIB on TikTok, paving the way for future research into influence operations in short-form video platforms.
中文: 本研究通过分析用户相似性网络,开发了一个检测TikTok上协同虚假行为的计算框架,揭示了传统指标的有效性及平台特有的内容和互动模式挑战。
English: This study introduces a computational framework for detecting coordinated inauthentic behavior on TikTok by analyzing user similarity networks, revealing both effective traditional indicators and platform-specific challenges in content and interaction patterns.

Authors:Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao
Title: Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism
Abstract:
Time series imputation is one of the most challenge problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random), and MNAR (Missing Not At Random) can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications.
中文摘要:本文提出了一种针对不同缺失机制(如随机缺失和非随机缺失)的时间序列填补框架,通过变分推理和基于标准化流的神经网络架构分别建模,实验证明该方法在多种数据集上优于现有技术。
English Summary: The paper introduces a framework for time series imputation that addresses different missing mechanisms (MAR and MNAR) through variational inference and normalizing flow-based architecture, outperforming existing methods by tailoring solutions to specific missing data patterns.

Authors:Yuxuan Zheng, Yihe Zhou, Feiyang Xu, Mingli Song, Shunyu Liu
Title: Bi-level Mean Field: Dynamic Grouping for Large-Scale MARL
Abstract:
Large-scale Multi-Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus reducing overall complexity to pairwise interactions. However, these MF methods inevitably fail to account for individual differences, leading to aggregation noise caused by inaccurate iterative updates during MF learning. In this paper, we propose a Bi-level Mean Field (BMF) method to capture agent diversity with dynamic grouping in large-scale MARL, which can alleviate aggregation noise via bi-level interaction. Specifically, BMF introduces a dynamic group assignment module, which employs a Variational AutoEncoder (VAE) to learn the representations of agents, facilitating their dynamic grouping over time. Furthermore, we propose a bi-level interaction module to model both inter- and intra-group interactions for effective neighboring aggregation. Experiments across various tasks demonstrate that the proposed BMF yields results superior to the state-of-the-art methods.
中文:提出的双层平均场方法通过变分自编码器实现智能体动态分组并建模双层交互,有效缓解大规模多智能体强化学习中的聚合噪声,实验表明其性能优于现有最优方法。
English: The proposed Bi-level Mean Field method addresses aggregation noise in large-scale MARL by dynamically grouping agents using VAE representations and modeling bi-level interactions, outperforming existing approaches in experiments.

Authors:Xingchen Li, LiDian Wang, Yu Sheng, ZhiPeng Tang, Haojie Ren, Guoliang You, YiFan Duan, Jianmin Ji, Yanyong Zhang
Title: ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost Sensors
Abstract:
Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission towers, which, however, struggle to measure true 3D distances due to the lack of depth information. Although 3D lasers can provide accurate depth data, their high cost makes large-scale deployment impractical. To address this challenge, we present ElectricSight, a system designed for 3D distance measurement and monitoring of potential hazards to power transmission lines. This work's key innovations lie in both the overall system framework and a monocular depth estimation method. Specifically, the system framework combines real-time images with environmental point cloud priors, enabling cost-effective and precise 3D distance measurements. As a core component of the system, the monocular depth estimation method enhances the performance by integrating 3D point cloud data into image-based estimates, improving both the accuracy and reliability of the system. To assess ElectricSight's performance, we conducted tests with data from a real-world power transmission scenario. The experimental results demonstrate that ElectricSight achieves an average accuracy of 1.08 m for distance measurements and an early warning accuracy of 92%.
中文: ElectricSight系统通过融合实时图像与环境点云数据,实现了对电力传输线路潜在危险的经济高效三维距离监测,其测量平均精度达1.08米,预警准确率达92%。
English: ElectricSight is a cost-effective system that combines real-time images with environmental point clouds to enable accurate 3D distance measurement and hazard monitoring for power transmission lines, achieving 1.08 m average accuracy and 92% early warning precision.

Authors:Haoxiang Luo, Gang Sun, Yinqiu Liu, Dongcheng Zhao, Dusit Niyato, Hongfang Yu, Schahram Dustdar
Title: A Weighted Byzantine Fault Tolerance Consensus Driven Trusted Multiple Large Language Models Network
Abstract:
Large Language Models (LLMs) have achieved remarkable success across a wide range of applications. However, individual LLMs often produce inconsistent, biased, or hallucinated outputs due to limitations in their training corpora and model architectures. Recently, collaborative frameworks such as the Multi-LLM Network (MultiLLMN) have been introduced, enabling multiple LLMs to interact and jointly respond to user queries. Nevertheless, MultiLLMN architectures raise critical concerns regarding the reliability and security of the generated content, particularly in open environments where malicious or compromised LLMs may be present. Moreover, reliance on centralized coordination undermines system efficiency and introduces single points of failure. In this paper, we propose a novel Trusted MultiLLMN framework, driven by a Weighted Byzantine Fault Tolerance (WBFT) blockchain consensus mechanism, to ensure the reliability, security, and efficiency of multi-LLM collaboration. In WBFT, voting weights are adaptively assigned to each LLM based on its response quality and trustworthiness, incentivizing reliable behavior, and reducing the impact of malicious nodes. Extensive simulations demonstrate that WBFT significantly improves both consensus security and efficiency compared to classical and modern consensus mechanisms, particularly under wireless network conditions. Furthermore, our evaluations reveal that Trusted MultiLLMN supported by WBFT can deliver higher-quality and more credible responses than both single LLMs and conventional MultiLLMNs, thereby providing a promising path toward building robust, decentralized AI collaboration networks.
Chinese: 本文提出了一种可信多LLM网络框架,采用加权拜占庭容错区块链共识机制,通过根据可信度和响应质量自适应分配权重,提升了多LLM协作的可靠性、安全性和效率。
English: The paper introduces a Trusted MultiLLMN framework using a Weighted Byzantine Fault Tolerance blockchain consensus to enhance the reliability, security, and efficiency of multi-LLM collaboration by adaptively weighting LLMs based on trustworthiness and response quality.

Authors:Xiaoyu Xu, Minxin Du, Qingqing Ye, Haibo Hu
Title: OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Abstract:
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose \textbf{OBLIVIATE}, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA) ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: \emph{forget quality} (via a new document-level memorization score), \emph{model utility}, and \emph{fluency}. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.
中文:OBLIVIATE是一个强大的遗忘框架,能有效从大语言模型中移除目标敏感数据并保持模型实用性,实验证明其能成功抵御攻击并在多种数据集中维持性能。
English: OBLIVIATE is a robust unlearning framework that effectively removes targeted sensitive data from large language models while preserving model utility, as demonstrated by its success in resisting attacks and maintaining performance across various datasets.

Authors:Haoxiang Luo, Gang Sun, Yinqiu Liu, Dusit Niyato, Hongfang Yu, Mohammed Atiquzzaman, Schahram Dustdar
Title: A Trustworthy Multi-LLM Network: Challenges,Solutions, and A Use Case
Abstract:
Large Language Models (LLMs) demonstrate strong potential across a variety of tasks in communications and networking due to their advanced reasoning capabilities. However, because different LLMs have different model structures and are trained using distinct corpora and methods, they may offer varying optimization strategies for the same network issues. Moreover, the limitations of an individual LLM's training data, aggravated by the potential maliciousness of its hosting device, can result in responses with low confidence or even bias. To address these challenges, we propose a blockchain-enabled collaborative framework that connects multiple LLMs into a Trustworthy Multi-LLM Network (MultiLLMN). This architecture enables the cooperative evaluation and selection of the most reliable and high-quality responses to complex network optimization problems. Specifically, we begin by reviewing related work and highlighting the limitations of existing LLMs in collaboration and trust, emphasizing the need for trustworthiness in LLM-based systems. We then introduce the workflow and design of the proposed Trustworthy MultiLLMN framework. Given the severity of False Base Station (FBS) attacks in B5G and 6G communication systems and the difficulty of addressing such threats through traditional modeling techniques, we present FBS defense as a case study to empirically validate the effectiveness of our approach. Finally, we outline promising future research directions in this emerging area.
中文: 大语言模型在通信与网络领域潜力显著,但存在可靠性与偏见问题,为此提出基于区块链的可信多模型协作框架MultiLLMN,通过伪基站防御案例验证其提升响应质量的有效性,并展望未来研究方向。
English: Large Language Models (LLMs) show promise in communications and networking but face challenges in reliability and bias, leading to the proposal of a blockchain-based collaborative framework called Trustworthy MultiLLMN to ensure high-quality responses, validated through a case study on False Base Station defense in advanced networks.

Authors:Chenxi Liu, Shaowen Zhou, Qianxiong Xu, Hao Miao, Cheng Long, Ziyue Li, Rui Zhao
Title: Towards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era
Abstract:
The proliferation of edge devices has generated an unprecedented volume of time series data across different domains, motivating various well-customized methods. Recently, Large Language Models (LLMs) have emerged as a new paradigm for time series analytics by leveraging the shared sequential nature of textual data and time series. However, a fundamental cross-modality gap between time series and LLMs exists, as LLMs are pre-trained on textual corpora and are not inherently optimized for time series. Many recent proposals are designed to address this issue. In this survey, we provide an up-to-date overview of LLMs-based cross-modality modeling for time series analytics. We first introduce a taxonomy that classifies existing approaches into four groups based on the type of textual data employed for time series modeling. We then summarize key cross-modality strategies, e.g., alignment and fusion, and discuss their applications across a range of downstream tasks. Furthermore, we conduct experiments on multimodal datasets from different application domains to investigate effective combinations of textual data and cross-modality strategies for enhancing time series analytics. Finally, we suggest several promising directions for future research. This survey is designed for a range of professionals, researchers, and practitioners interested in LLM-based time series modeling.
中文摘要:本综述系统梳理了基于大语言模型的时间序列跨模态分析方法,通过分类现有策略、实验验证及未来展望,为解决时序数据与文本模态差异提供了最新研究框架。
English Summary: This survey provides an up-to-date overview of Large Language Models (LLMs) for time series analytics, addressing the cross-modality gap through classification of approaches, analysis of key strategies, experimental validation, and future research directions.

Authors:Chenxi Liu, Hao Miao, Qianxiong Xu, Shaowen Zhou, Cheng Long, Yan Zhao, Ziyue Li, Rui Zhao
Title: Efficient Multivariate Time Series Forecasting via Calibrated Language Models with Privileged Knowledge Distillation
Abstract:
Multivariate time series forecasting (MTSF) endeavors to predict future observations given historical data, playing a crucial role in time series data management systems. With advancements in large language models (LLMs), recent studies employ textual prompt tuning to infuse the knowledge of LLMs into MTSF. However, the deployment of LLMs often suffers from low efficiency during the inference phase. To address this problem, we introduce TimeKD, an efficient MTSF framework that leverages the calibrated language models and privileged knowledge distillation. TimeKD aims to generate high-quality future representations from the proposed cross-modality teacher model and cultivate an effective student model. The cross-modality teacher model adopts calibrated language models (CLMs) with ground truth prompts, motivated by the paradigm of Learning Under Privileged Information (LUPI). In addition, we design a subtractive cross attention (SCA) mechanism to refine these representations. To cultivate an effective student model, we propose an innovative privileged knowledge distillation (PKD) mechanism including correlation and feature distillation. PKD enables the student to replicate the teacher's behavior while minimizing their output discrepancy. Extensive experiments on real data offer insight into the effectiveness, efficiency, and scalability of the proposed TimeKD.
中文摘要:TimeKD是一种高效的多元时间序列预测框架,通过校准语言模型和特权知识蒸馏技术,在提升预测精度的同时解决了大型语言模型推理效率低下的问题。
English Summary: TimeKD is an efficient multivariate time series forecasting framework that uses calibrated language models and privileged knowledge distillation to enhance prediction accuracy while addressing the low inference efficiency of large language models.

Authors:Jie Yang, Yuwen Wang, Kaixuan Chen, Tongya Zheng, Yihe Zhou, Zhenbang Xiao, Ji Cao, Mingli Song, Shunyu Liu
Title: From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks
Abstract:
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.
中文: 本文提出了一种树状可解释框架(TIF),通过将图神经网络转换为具有多粒度图粗化结构的层次树,能够自适应识别关键路径,在保持优异分类性能的同时实现决策过程的多粒度可解释性。
English: This paper introduces a Tree-like Interpretable Framework (TIF) that transforms GNNs into hierarchical trees with multi-granular graph coarsening, enabling adaptive identification of informative paths for both accurate predictions and multi-level interpretability in graph classification tasks.

Authors:Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
Title: CodeV-R1: Reasoning-Enhanced Verilog Generation
Abstract:
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.
中文摘要:CodeV-R1框架通过可验证奖励的强化学习训练大语言模型,利用测试平台生成器、往返数据合成和两阶段训练流程解决验证环境和数据稀缺的挑战,在Verilog代码生成任务中实现了超越现有最优模型的性能表现。
English Summary: The CodeV-R1 framework trains large language models using reinforcement learning with verifiable reward to generate Verilog code from natural language, overcoming verification and data scarcity challenges through a testbench generator, round-trip data synthesis, and a two-stage training pipeline, achieving state-of-the-art performance on benchmarks.

Authors:Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
Title: QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation
Abstract:
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.
中文摘要:CodeV-R1框架通过可验证奖励的强化学习训练大语言模型,利用测试平台生成器、往返数据合成和两阶段训练流程解决验证环境和数据稀缺的挑战,在Verilog代码生成任务中实现了超越现有最优模型的性能表现。
English Summary: The CodeV-R1 framework trains large language models using reinforcement learning with verifiable reward to generate Verilog code from natural language, overcoming verification and data scarcity challenges through a testbench generator, round-trip data synthesis, and a two-stage training pipeline, achieving state-of-the-art performance on benchmarks.

Authors:Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang
Title: MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Abstract:
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .
中文:MMSI-Bench作为专为评估多图像空间智能设计的新基准,揭示了当前多模态大语言模型与人类表现之间的巨大差距,并通过系统化错误分析为后续研究提供了明确改进方向。
English: MMSI-Bench is a new benchmark designed to evaluate multi-image spatial reasoning in multimodal large language models, revealing a significant performance gap between current models and human capabilities while providing detailed error analysis for future improvements.

Authors:Aravind R. Krishnan, Thomas Z. Li, Lucas W. Remedios, Michael E. Kim, Chenyu Gao, Gaurav Rudravaram, Elyssa M. McMaster, Adam M. Saunders, Shunxing Bao, Kaiwen Xu, Lianrui Zuo, Kim L. Sandler, Fabien Maldonado, Yuankai Huo, Bennett A. Landman
Title: Multipath cycleGAN for harmonization of paired and unpaired low-dose lung computed tomography reconstruction kernels
Abstract:
Reconstruction kernels in computed tomography (CT) affect spatial resolution and noise characteristics, introducing systematic variability in quantitative imaging measurements such as emphysema quantification. Choosing an appropriate kernel is therefore essential for consistent quantitative analysis. We propose a multipath cycleGAN model for CT kernel harmonization, trained on a mixture of paired and unpaired data from a low-dose lung cancer screening cohort. The model features domain-specific encoders and decoders with a shared latent space and uses discriminators tailored for each domain.We train the model on 42 kernel combinations using 100 scans each from seven representative kernels in the National Lung Screening Trial (NLST) dataset. To evaluate performance, 240 scans from each kernel are harmonized to a reference soft kernel, and emphysema is quantified before and after harmonization. A general linear model assesses the impact of age, sex, smoking status, and kernel on emphysema. We also evaluate harmonization from soft kernels to a reference hard kernel. To assess anatomical consistency, we compare segmentations of lung vessels, muscle, and subcutaneous adipose tissue generated by TotalSegmentator between harmonized and original images. Our model is benchmarked against traditional and switchable cycleGANs. For paired kernels, our approach reduces bias in emphysema scores, as seen in Bland-Altman plots (p<0.05). For unpaired kernels, harmonization eliminates confounding differences in emphysema (p>0.05). High Dice scores confirm preservation of muscle and fat anatomy, while lung vessel overlap remains reasonable. Overall, our shared latent space multipath cycleGAN enables robust harmonization across paired and unpaired CT kernels, improving emphysema quantification and preserving anatomical fidelity.
中文: 本研究提出的多路径循环生成对抗网络模型能够有效协调CT重建核,在降低肺气肿定量偏差的同时,保持配对与非配对核场景下的解剖结构保真度。
English: The proposed multipath cycleGAN model effectively harmonizes CT reconstruction kernels, reducing bias in emphysema quantification while maintaining anatomical fidelity across both paired and unpaired kernel scenarios.

Authors:Peiliang Gong, Yucheng Wang, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang
Title: Temporal Restoration and Spatial Rewiring for Source-Free Multivariate Time Series Domain Adaptation
Abstract:
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained model from an annotated source domain to an unlabelled target domain without accessing the source data, thereby preserving data privacy. While existing SFDA methods have proven effective in reducing reliance on source data, they struggle to perform well on multivariate time series (MTS) due to their failure to consider the intrinsic spatial correlations inherent in MTS data. These spatial correlations are crucial for accurately representing MTS data and preserving invariant information across domains. To address this challenge, we propose Temporal Restoration and Spatial Rewiring (TERSE), a novel and concise SFDA method tailored for MTS data. Specifically, TERSE comprises a customized spatial-temporal feature encoder designed to capture the underlying spatial-temporal characteristics, coupled with both temporal restoration and spatial rewiring tasks to reinstate latent representations of the temporally masked time series and the spatially masked correlated structures. During the target adaptation phase, the target encoder is guided to produce spatially and temporally consistent features with the source domain by leveraging the source pre-trained temporal restoration and spatial rewiring networks. Therefore, TERSE can effectively model and transfer spatial-temporal dependencies across domains, facilitating implicit feature alignment. In addition, as the first approach to simultaneously consider spatial-temporal consistency in MTS-SFDA, TERSE can also be integrated as a versatile plug-and-play module into established SFDA methods. Extensive experiments on three real-world time series datasets demonstrate the effectiveness and versatility of our approach.
中文摘要:本文提出TERSE方法,针对多元时间序列的无源域自适应问题,通过时间恢复和空间重连机制保持跨领域的时空一致性,解决了现有方法忽略空间相关性的局限。
English Summary: The paper introduces TERSE, a novel Source-Free Domain Adaptation method designed for multivariate time series data that addresses the limitations of existing approaches by incorporating temporal restoration and spatial rewiring to maintain spatial-temporal consistency across domains.

Authors:Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, Yu Kong
Title: IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios
Abstract:
Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.
中文:IndustryEQA基准通过高保真仓库模拟和危险场景,结合六类丰富注释与推理评估,填补了工业安全关键环境中具身智能体评估的空白。
English: The IndustryEQA benchmark addresses the gap in evaluating embodied agents for industrial safety-critical scenarios by providing high-fidelity warehouse simulations with hazardous situations and comprehensive reasoning assessments.

Authors:Xiao Chen, Tai Wang, Quanyi Li, Tao Huang, Jiangmiao Pang, Tianfan Xue
Title: GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scenes
Abstract:
Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first large-scale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and real-scan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes. Project page: https://xiao-chen.tech/gleam/.
中文摘要:提出的GLEAM框架通过语义表征和随机策略实现了通用性主动建图,在128个未见复杂场景中达到66.50%的覆盖率,显著优于现有方法。
English Summary: The proposed GLEAM framework introduces a unified generalizable exploration policy for active mapping, achieving 66.50% coverage with improved efficiency and accuracy on unseen complex scenes through semantic representations and novel strategies.

Authors:Xiao Chen, Tai Wang, Quanyi Li, Tao Huang, Jiangmiao Pang, Tianfan Xue
Title: GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scenes
Abstract:
Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first large-scale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and real-scan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes. Project page: https://xiao-chen.tech/gleam/.
中文摘要:提出的GLEAM框架通过语义表征和随机策略实现了通用性主动建图,在128个未见复杂场景中达到66.50%的覆盖率,显著优于现有方法。
English Summary: The proposed GLEAM framework introduces a unified generalizable exploration policy for active mapping, achieving 66.50% coverage with improved efficiency and accuracy on unseen complex scenes through semantic representations and novel strategies.

Authors:Fanheng Kong, Jingyuan Zhang, Hongzhi Zhang, Shi Feng, Daling Wang, Linhao Yu, Xingguang Ji, Yu Tian, Victoria W., Fuzheng Zhang
Title: TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Abstract:
Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models. The data and code are available at https://friedrichor.github.io/projects/TUNA.
Chinese: TUNA基准通过引入描述和问答任务,解决了视频中时间元素的整体整合问题,揭示了现有模型在动作描述有限和多主体理解不足等方面的关键挑战。
English: The TUNA benchmark addresses the holistic integration of temporal elements in videos by introducing captioning and QA tasks, revealing key challenges like limited action description and inadequate multi-subject understanding in current models.

Authors:Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou
Title: Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Abstract:
Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at https://friedrichor.github.io/projects/UNITE.
中文: 本文提出UNITE通用框架,通过数据筛选和模态感知训练配置解决多模态信息检索中的模态差异问题,采用模态感知掩码对比学习等创新方法,在多个基准测试中取得了领先性能。
English: This paper introduces UNITE, a universal framework addressing multimodal information retrieval challenges through data curation and modality-aware training, achieving state-of-the-art results by mitigating modal gaps with innovative strategies like Modal-Aware Masked Contrastive Learning.

Authors:Shi-Yu Tian, Zhi Zhou, Wei Dong, Ming Yang, Kun-Yang Yu, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Title: Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights
Abstract:
Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.
AutoT2T is a neuro-symbolic framework that transforms math word problems into scalable tabular reasoning tasks, creating the TabularGSM benchmark which reveals that tabular structures increase reasoning difficulty due to combined retrieval-reasoning challenges and robustness issues in LLMs.
English Summary:

Authors:Shi-Yu Tian, Zhi Zhou, Wei Dong, Kun-Yang Yu, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Title: TabularGSM: Understanding the Limitations of LLMs in Tabular Math Reasoning
Abstract:
Mathematical reasoning has long been a key benchmark for evaluating large language models (LLMs). Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks, enabling the evaluation of both accuracy and robustness. Building on this pipeline, we develop TabularGSM, a benchmark comprising three progressively complex subsets and a trap subset, with two complementary evaluation settings. Our study reveals three key observations: (1) Tabular structure makes mathematical reasoning more challenging; (2) The difficulties stem from the joint effects of tabular retrieval and reasoning; (3) Reasoning robustness is another significant issue that needs to be addressed in existing LLMs. In-depth analyses are conducted for each observation to guide future research.
AutoT2T is a neuro-symbolic framework that transforms math word problems into scalable tabular reasoning tasks, creating the TabularGSM benchmark which reveals that tabular structures increase reasoning difficulty due to combined retrieval-reasoning challenges and robustness issues in LLMs.
English Summary:

Authors:Chi Zhang, Luca Colagrande, Renzo Andri, Thomas Benz, Gamze Islamoglu, Alessandro Nadalini, Francesco Conti, Yawei Li, Luca Benini
Title: FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators
Abstract:
Multi-Head Attention (MHA) is a critical computational kernel in transformer-based AI models. Emerging scalable tile-based accelerator architectures integrate increasing numbers of tightly-packed processing elements (PEs) with tensor units. MHA dataflow mapping is crucial for achieving high utilization of the available units. We propose FlatAttention, a new dataflow for MHA on tile-based many-PE accelerators, minimizing costly main memory (HBM) accesses by leveraging collective primitives integrated into the on-chip network fabric. FlatAttention achieves up to 89.3% utilization, and 4.1x performance speedup over FlashAttention-3 dataflow on tile-based accelerators whilst reducing HBM traffic by 16x. Through algorithm-architecture co-exploration, we identify an optimal configuration for a large scaled-out tile-based accelerator featuring a 32x32 tile mesh with 1024 TFLOPS @ FP16 peak performance, comparable to the state-of-the-art Nvidia H100 GPU. FlatAttention in this configuration achieves up to 1.3x higher utilization over FlashAttention-3 on the H100 GPU. Meanwhile, this tile-based accelerator configuration requires 40% less HBM bandwidth compared to the H100, enabling a 1.8x reduction in die size, estimated on the same technology node.
中文: FlatAttention是一种面向瓦片架构加速器的多头注意力创新数据流,可减少内存访问,相比FlashAttention-3实现最高4.1倍加速,同时将HBM通信量降低16倍。
English: FlatAttention is a novel dataflow for Multi-Head Attention on tile-based accelerators that minimizes memory accesses and achieves up to 4.1x speedup over FlashAttention-3 while reducing HBM traffic by 16x.

Authors:Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen
Title: The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Abstract:
Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.
中文: 本文提出了首个数据为中心的思维链蒸馏基准DC-CoT,通过多种师生模型配置系统评估数据操作对模型性能的影响,重点关注分布内与分布外泛化能力。
English: This paper introduces DC-CoT, the first comprehensive benchmark for evaluating data-centric distillation techniques in chain-of-thought reasoning, assessing their impact on student model performance across various teacher-student configurations and generalization scenarios.

Authors:Jiongran Wu, Jiahao Liu, Dongsheng Li, Guangping Zhang, Mingzhe Han, Hansu Gu, Peng Zhang, Li Shang, Tun Lu, Ning Gu
Title: Bidirectional Knowledge Distillation for Enhancing Sequential Recommendation with Large Language Models
Abstract:
Large language models (LLMs) have demonstrated exceptional performance in understanding and generating semantic patterns, making them promising candidates for sequential recommendation tasks. However, when combined with conventional recommendation models (CRMs), LLMs often face challenges related to high inference costs and static knowledge transfer methods. In this paper, we propose a novel mutual distillation framework, LLMD4Rec, that fosters dynamic and bidirectional knowledge exchange between LLM-centric and CRM-based recommendation systems. Unlike traditional unidirectional distillation methods, LLMD4Rec enables iterative optimization by alternately refining both models, enhancing the semantic understanding of CRMs and enriching LLMs with collaborative signals from user-item interactions. By leveraging sample-wise adaptive weighting and aligning output distributions, our approach eliminates the need for additional parameters while ensuring effective knowledge transfer. Extensive experiments on real-world datasets demonstrate that LLMD4Rec significantly improves recommendation accuracy across multiple benchmarks without increasing inference costs. This method provides a scalable and efficient solution for combining the strengths of both LLMs and CRMs in sequential recommendation systems.
Chinese: 提出的LLMD4Rec框架实现了大语言模型与传统推荐模型之间的动态双向知识蒸馏,通过迭代优化和自适应加权,在不增加推理成本的情况下显著提升了推荐准确性。
English: The proposed LLMD4Rec framework enables dynamic bidirectional knowledge distillation between large language models and conventional recommendation models, enhancing recommendation accuracy without increasing inference costs through iterative optimization and adaptive weighting.

Authors:Ömer Faruk Akgül, Feiyu Zhu, Yuxin Yang, Rajgopal Kannan, Viktor Prasanna
Title: RECIPE-TKG: From Sparse History to Structured Reasoning for LLM-based Temporal Knowledge Graph Completion
Abstract:
Temporal Knowledge Graphs (TKGs) represent dynamic facts as timestamped relations between entities. TKG completion involves forecasting missing or future links, requiring models to reason over time-evolving structure. While LLMs show promise for this task, existing approaches often overemphasize supervised fine-tuning and struggle particularly when historical evidence is limited or missing. We introduce RECIPE-TKG, a lightweight and data-efficient framework designed to improve accuracy and generalization in settings with sparse historical context. It combines (1) rule-based multi-hop retrieval for structurally diverse history, (2) contrastive fine-tuning of lightweight adapters to encode relational semantics, and (3) test-time semantic filtering to iteratively refine generations based on embedding similarity. Experiments on four TKG benchmarks show that RECIPE-TKG outperforms previous LLM-based approaches, achieving up to 30.6\% relative improvement in Hits@10. Moreover, our proposed framework produces more semantically coherent predictions, even for the samples with limited historical context.
Chinese Summary: RECIPE-TKG是一种轻量级框架,通过结合基于规则的历史检索、对比微调和语义过滤,显著提升了时序知识图谱补全的性能,尤其在历史证据稀疏的场景下表现优异。
English Summary: RECIPE-TKG is a lightweight framework that enhances temporal knowledge graph completion by combining rule-based history retrieval, contrastive fine-tuning, and semantic filtering, achieving significant performance improvements especially in sparse historical contexts.

Authors:Weiming Wu, Zi-kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation
Abstract:
Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (Q&A) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.
中文: NeSyGeo框架提出了一种神经符号方法,用于生成多样化的几何推理数据,通过少量训练即可显著提升多模态大语言模型在多个基准测试中的几何推理能力。
English: The NeSyGeo framework introduces a neuro-symbolic approach to generate diverse geometric reasoning data, significantly enhancing multi-modal language models' performance across benchmarks with minimal training.

Authors:Weiming Wu, Jin Ye, Zi-kang Wang, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation
Abstract:
Obtaining large-scale, high-quality reasoning data is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined tem plates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-attributes-relations paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to visual and textual representations and generates reasoning path with reverse search and forward validation. Based on this framework, we construct NeSyGeo CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.s
中文: NeSyGeo框架提出了一种神经符号方法,用于生成多样化的几何推理数据,通过少量训练即可显著提升多模态大语言模型在多个基准测试中的几何推理能力。
English: The NeSyGeo framework introduces a neuro-symbolic approach to generate diverse geometric reasoning data, significantly enhancing multi-modal language models' performance across benchmarks with minimal training.

Authors:Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, Xinpeng Zhang
Title: CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models
Abstract:
Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.
Chinese: 本文提出CoTSRF,一种新颖的大语言模型指纹识别方案,通过对比学习训练思维链提取器,并利用KL散度比较进行指纹验证,实现了隐蔽且鲁棒的指纹识别。
English: This paper introduces CoTSRF, a novel LLM fingerprinting method that uses Chain of Thought as a covert and robust identifier by training a CoT extractor through contrastive learning and verifying fingerprints via KL divergence comparison.

Authors:Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: Realistic Evaluation of TabPFN v2 in Open Environments
Abstract:
Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2's adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at https://anonymous.4open.science/r/tabpfn-ood-4E65.
中文: TabPFN v2在开放环境中表现出明显局限性,但在小规模、协变量偏移和类别平衡任务中表现良好,而树模型仍是开放环境下通用表格任务的最佳选择。
English: TabPFN v2 shows limitations in open environments but excels in small-scale, covariate-shifted, and class-balanced tasks, while tree-based models remain superior for general tabular tasks under such conditions.

Authors:Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
Title: SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
Abstract:
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
Chinese: 提出的SCENIR框架通过无监督图自编码器,基于语义场景图而非低级特征进行图像检索,不仅摆脱了对标注数据的依赖,还首次采用图编辑距离作为相似性评估标准,有效提升了检索性能。
English: The proposed SCENIR framework addresses biases in image retrieval by using an unsupervised Graph Autoencoder that prioritizes semantic scene graphs over low-level features, eliminating reliance on labeled data and introducing Graph Edit Distance for robust similarity evaluation.

Authors:Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu, Jihong Zhang, Jinbao Xue, Jun Xia, Junqiang Zheng, Kai Liu, Kai Zhang, Kai Zheng, Kejiao Li, Keyao Wang, Lan Jiang, Lixin Liu, Lulu Wu, Mengyuan Huang, Peijie Yu, Peiqi Wang, Qian Wang, Qianbiao Xiang, Qibin Liu, Qingfeng Sun, Richard Guo, Ruobing Xie, Saiyong Yang, Shaohua Chen, Shihui Hu, Shuai Li, Shuaipeng Li, Shuang Chen, Suncong Zheng, Tao Yang, Tian Zhang, Tinghao Yu, Weidong Han, Weijie Liu, Weijin Zhou, Weikang Wang, Wesleye Chen, Xiao Feng, Xiaoqin Ren, Xingwu Sun, Xiong Kuang, Xuemeng Huang, Xun Cao, Yanfeng Chen, Yang Du, Zhen Yang, Yangyu Tao, Yaping Deng, Yi Shen, Yigeng Hong, Yiqi Chen, Yiqing Huang, Yuchi Deng, Yue Mao, Yulong Wang, Yuyuan Zeng, Zenan Xu, Zhanhui Kang, Zhe Zhao, ZhenXiang Yan, Zheng Fang, Zhichao Hu, Zhongzhi Chen, Zhuoyu Li, Zongwei Li, Alex Yan, Ande Liang, Baitong Liu, Beiping Pan, Bin Xing, Binghong Wu, Bingxin Qu, Bolin Ni, Boyu Wu, Chen Li, Cheng Jiang, Cheng Zhang, Chengjun Liu, Chengxu Yang, Chengzhong Xu, Chiyu Wang, Chong Zha, Daisy Yi, Di Wang, Fanyang Lu, Fei Chen, Feifei Liu, Feng Zheng, Guanghua Yu, Guiyang Li, Guohua Wang, Haisheng Lin, Han Liu, Han Wang, Hao Fei, Hao Lu, Haoqing Jiang, Haoran Sun, Haotian Zhu, Huangjin Dai, Huankui Chen, Huawen Feng, Huihui Cai, Huxin Peng, Jackson Lv, Jiacheng Shi, Jiahao Bu, Jianbo Li, Jianglu Hu, Jiangtao Guan, Jianing Xu, Jianwei Cai, Jiarong Zhang, Jiawei Song, Jie Jiang, Jie Liu, Jieneng Yang, Jihong Zhang, Jin lv, Jing Zhao, Jinjian Li, Jinxing Liu, Jun Zhao, Juntao Guo, Kai Wang, Kan Wu, Lei Fu, Lei He, Lei Wang, Li Liu, Liang Dong, Liya Zhan, Long Cheng, Long Xu, Mao Zheng, Meng Liu, Mengkang Hu, Nanli Chen, Peirui Chen, Peng He, Pengju Pan, Pengzhi Wei, Qi Yang, Qi Yi, Roberts Wang, Rongpeng Chen, Rui Sun, Rui Yang, Ruibin Chen, Ruixu Zhou, Shaofeng Zhang, Sheng Zhang, Shihao Xu, Shuaishuai Chang, Shulin Liu, SiQi Wang, Songjia Feng, Songling Yuan, Tao Zhang, Tianjiao Lang, Tongkai Li, Wei Deng, Wei Li, Weichao Wang, Weigang Zhang, Weixuan Sun, Wen Ouyang, Wenxiang Jiao, Wenzhi Sun, Wenzhuo Jia, Xiang Zhang, Xiangyu He, Xianshun Ren, XiaoYing Zhu, Xiaolong Guo, Xiaoxue Li, Xiaoyu Ma, Xican Lu, Xinhua Feng, Xinting Huang, Xinyu Guan, Xirui Li, Xu Zhang, Xudong Gao, Xun Luo, Xuxiang Qi, Yangkun Chen, Yangyu Tao, Yanling Xiao, Yantao Mai, Yanze Chen, Yao Ding, Yeting Yang, YiFan Song, Yifan Yang, Yijiao Zhu, Yinhe Wu, Yixian Liu, Yong Yang, Yuanjun Cai, Yuanlin Tu, Yue Zhang, Yufei Huang, Yuhang Zhou, Yuhao Jiang, Yuhong Liu, Yuhui Hu, Yujin Lin, Yun Yang, Yunhao Wang, Yusong Zhang, Zekun Wu, Zelong Zhang, Zhan Yu, Zhaoliang Yang, Zhe Zhao, Zheng Li, Zhenyu Huang, Zhiguang Liu, Zhijiang Xu, Zhiqing Kui, Zhiyin Zeng, Zhiyuan Xiong, Zhuo Han, Zifan Wu, Zigang Geng, Zilong Zhao, Ziyan Tang, Ziyuan Zhu, Zonglei Zhu, Zhijiang Xu
Title: Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
Abstract:
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
Hunyuan-TurboS is a hybrid Transformer-Mamba MoE model that combines efficient long-sequence processing with superior contextual understanding, achieving top-tier performance on benchmarks while optimizing computational resources through adaptive reasoning mechanisms.
English Summary:

Authors:Wenhui Zhu, Xuanzhao Dong, Xin Li, Peijie Qiu, Xiwen Chen, Abolfazl Razi, Aris Sotiras, Yi Su, Yalin Wang
Title: Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models
Abstract:
Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.
Chinese Summary: 近期基于强化学习的调优方法(特别是GRPO)虽提升了多模态大语言模型,但在医疗应用中仍面临挑战;本研究确定了影响医疗视觉问答调优效果的四个关键因素,并证明GRPO在准确性和推理质量上均优于标准监督微调。
English Summary: Recent advances in RL-based tuning, particularly GRPO, have improved MLLMs but face challenges in medical applications; this study identifies four key factors for effective medical VQA tuning and shows GRPO's superiority over SFT in accuracy and reasoning.

Authors:Shitong Duan, Xiaoyuan Yi, Peng Zhang, Dongkuan Xu, Jing Yao, Tun Lu, Ning Gu, Xing Xie
Title: AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference
Abstract:
Assessing Large Language Models (LLMs)' underlying value differences enables comprehensive comparison of their misalignment, cultural adaptability, and biases. Nevertheless, current value measurement datasets face the informativeness challenge: with often outdated, contaminated, or generic test questions, they can only capture the shared value orientations among different LLMs, leading to saturated and thus uninformative results. To address this problem, we introduce AdAEM, a novel, self-extensible assessment framework for revealing LLMs' inclinations. Distinct from previous static benchmarks, AdAEM can automatically and adaptively generate and extend its test questions. This is achieved by probing the internal value boundaries of a diverse set of LLMs developed across cultures and time periods in an in-context optimization manner. The optimization process theoretically maximizes an information-theoretic objective to extract the latest or culturally controversial topics, providing more distinguishable and informative insights about models' value differences. In this way, AdAEM is able to co-evolve with the development of LLMs, consistently tracking their value dynamics. Using AdAEM, we generate 12,310 questions grounded in Schwartz Value Theory, conduct an extensive analysis to manifest our method's validity and effectiveness, and benchmark the values of 16 LLMs, laying the groundwork for better value research.
中文: AdAEM框架通过动态生成和扩展测试问题,聚焦于当前和文化争议性话题,揭示大型语言模型之间的价值差异,克服了传统数据集过时和泛化的局限,提供了更具区分度的见解。
English: The AdAEM framework is introduced to dynamically generate and extend test questions that reveal the value differences among Large Language Models by focusing on current and culturally controversial topics, overcoming the limitations of outdated and generic datasets to provide more distinguishable insights.

Authors:Shengkang Gu, Jiahao Liu, Dongsheng Li, Guangping Zhang, Mingzhe Han, Hansu Gu, Peng Zhang, Ning Gu, Li Shang, Tun Lu
Title: LLM-Based User Simulation for Low-Knowledge Shilling Attacks on Recommender Systems
Abstract:
Recommender systems (RS) are increasingly vulnerable to shilling attacks, where adversaries inject fake user profiles to manipulate system outputs. Traditional attack strategies often rely on simplistic heuristics, require access to internal RS data, and overlook the manipulation potential of textual reviews. In this work, we introduce Agent4SR, a novel framework that leverages Large Language Model (LLM)-based agents to perform low-knowledge, high-impact shilling attacks through both rating and review generation. Agent4SR simulates realistic user behavior by orchestrating adversarial interactions, selecting items, assigning ratings, and crafting reviews, while maintaining behavioral plausibility. Our design includes targeted profile construction, hybrid memory retrieval, and a review attack strategy that propagates target item features across unrelated reviews to amplify manipulation. Extensive experiments on multiple datasets and RS architectures demonstrate that Agent4SR outperforms existing low-knowledge baselines in both effectiveness and stealth. Our findings reveal a new class of emergent threats posed by LLM-driven agents, underscoring the urgent need for enhanced defenses in modern recommender systems.
中文: Agent4SR是一种基于大语言模型的创新框架,通过生成评分和评论执行低知识高影响力的托攻击,在效果和隐蔽性上优于现有方法,同时揭示了推荐系统面临的新型威胁。
English: Agent4SR is a novel LLM-based agent framework that executes low-knowledge, high-impact shilling attacks by generating both ratings and reviews, outperforming existing methods in effectiveness and stealth while highlighting new threats to recommender systems.

Authors:Song-Lin Lv, Rui Zhu, Yu-Feng Li, Lan-Zhe Guo
Title: Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning
Abstract:
Semi-supervised learning (SSL) alleviates the cost of data labeling process by exploiting unlabeled data, and has achieved promising results on various tasks such as image classification. Meanwhile, the Pretrain-Finetuning paradigm has garnered significant attention in recent years, and exploiting pre-trained models could also reduce the requirement of labeled data in downstream tasks. Therefore, a question naturally occurs: \emph{When the labeled data is scarce in the target tasks, should we exploit unlabeled data or pre-trained models?} To answer this question, we select pre-trained Vision-Language Models (VLMs) as representative pretrain-finetuning instances and propose \textit{Few-shot SSL} -- a framework that enables fair comparison between these two paradigms by controlling the amount of labeled data used. Extensive experiments across various settings demonstrate that pre-trained VLMs generally outperform SSL methods in nearly all cases, except when the data has low resolution or lacks clear semantic structure. Therefore, we encourage future SSL research to compare with pre-trained models and explore deeper integration, such as using pre-trained knowledge to enhance pseudo-labeling. To support future research, we release our unified reproduction and evaluation framework. Codes are available \href{https://anonymous.4open.science/r/Rethinking-SSL-and-Pretrain-Finetuning-5566 }{here}.
中文: 研究发现,在标注数据有限的情况下,预训练的视觉语言模型通常优于半监督学习方法,除非数据分辨率低或语义结构不清晰,因此建议未来的半监督学习研究应整合预训练模型以提升性能。
English: The study finds that pre-trained vision-language models generally outperform semi-supervised learning methods in scenarios with limited labeled data, except when dealing with low-resolution or semantically ambiguous data, suggesting future SSL research should integrate pre-trained models for enhanced performance.

Authors:Gamze İslamoğlu, Luca Bertaccini, Arpan Suravi Prasad, Francesco Conti, Angelo Garofalo, Luca Benini
Title: MXDOTP: A RISC-V ISA Extension for Enabling Microscaling (MX) Floating-Point Dot Products
Abstract:
Fast and energy-efficient low-bitwidth floating-point (FP) arithmetic is essential for Artificial Intelligence (AI) systems. Microscaling (MX) standardized formats have recently emerged as a promising alternative to baseline low-bitwidth FP formats, offering improved accuracy with a block-wise shared exponent scale combined with per-element values. However, efficiently executing the key linear algebra primitives for AI applications on MX formats requires specialized hardware support for the fundamental operators such as scaled dot product. In this work, we propose MXDOTP, the first RISC-V ISA extension for MX dot products, focusing on the 8-bit MXFP8 FP format. We extend the open-source Snitch RISC-V core with a dedicated MXFP8 dot product-accumulate unit, which fully consumes blocks of eight 8-bit operands packed into 64-bit inputs. To feed MXDOTP at full utilization with four operands per cycle, including block scales, we exploit Snitch's Stream Semantic Registers (SSRs), achieving up to 80% utilization with minimal impact on the Snitch core's architecture and no modification to the register file. Implemented in 12 nm FinFET, a cluster with eight MXDOTP-extended cores reaches up to 356 GFLOPS/W when computing MXFP8 matrix multiplications at 0.8 V, 1 GHz. Compared to a software baseline, where MX dot products are computed by type casting FP8 inputs to FP32 for higher accumulation precision and applying explicit block scaling, the cluster achieves 25x speedup and 12.5x better energy efficiency at a minimal 5.1% area increase.
中文: 本文提出MXDOTP,一种针对微缩放点积运算的RISC-V指令集扩展,通过专用硬件支持MXFP8格式计算,在仅增加5.1%芯片面积的情况下实现了25倍加速和12.5倍能效提升。
English: This paper introduces MXDOTP, a RISC-V ISA extension for microscaling (MX) dot products, which enhances AI systems by providing specialized hardware support for efficient MXFP8 arithmetic, achieving significant speed and energy efficiency improvements with minimal area overhead.

Authors:Bo Yang, Hengwei Zhang, Jindong Wang, Yuchen Ren, Chenhao Lin, Chao Shen, Zhengyu Zhao
Title: Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency
Abstract:
In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be identical across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model diversity.In this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.
中文摘要:选择性集成攻击(SEA)通过跨迭代动态选择多样化模型,打破了替代模型集成攻击中迁移性与效率的固有平衡,能够在保持效率的同时显著提升迁移攻击效果。
English Summary: The Selective Ensemble Attack (SEA) overcomes the trade-off between transferability and efficiency in surrogate ensemble attacks by dynamically selecting diverse models across iterations, enabling higher transferability without sacrificing resource efficiency.

Authors:Julian Tanke, Takashi Shibuya, Kengo Uchida, Koichi Saito, Yuki Mitsufuji
Title: Dyadic Mamba: Long-term Dyadic Human Motion Synthesis
Abstract:
Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.
中文: 本文提出Dyadic Mamba方法,利用状态空间模型生成任意长度的高质量双人交互动作,在长序列上显著优于基于Transformer的方法,同时保持短期基准测试的竞争力。
English: This paper introduces Dyadic Mamba, a novel approach using State-Space Models to generate high-quality dyadic human motion of arbitrary length, which outperforms transformer-based methods on longer sequences while maintaining competitive performance on short-term benchmarks.

Authors:Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, Yujian Zhang, Meng Wei, Hanqing Wang, Yilun Chen, Tai Wang, Jiangmiao Pang
Title: NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance
Abstract:
Learning navigation in dynamic open-world environments is an important yet challenging skill for robots. Most previous methods rely on precise localization and mapping or learn from expensive real-world demonstrations. In this paper, we propose the Navigation Diffusion Policy (NavDP), an end-to-end framework trained solely in simulation and can zero-shot transfer to different embodiments in diverse real-world environments. The key ingredient of NavDP's network is the combination of diffusion-based trajectory generation and a critic function for trajectory selection, which are conditioned on only local observation tokens encoded from a shared policy transformer. Given the privileged information of the global environment in simulation, we scale up the demonstrations of good quality to train the diffusion policy and formulate the critic value function targets with contrastive negative samples. Our demonstration generation approach achieves about 2,500 trajectories/GPU per day, 20$\times$ more efficient than real-world data collection, and results in a large-scale navigation dataset with 363.2km trajectories across 1244 scenes. Trained with this simulation dataset, NavDP achieves state-of-the-art performance and consistently outstanding generalization capability on quadruped, wheeled, and humanoid robots in diverse indoor and outdoor environments. In addition, we present a preliminary attempt at using Gaussian Splatting to make in-domain real-to-sim fine-tuning to further bridge the sim-to-real gap. Experiments show that adding such real-to-sim data can improve the success rate by 30\% without hurting its generalization capability.
中文摘要:本文提出NavDP框架,通过扩散轨迹生成与评估器结合的仿真训练方法,实现无需真实演示的零样本跨机器人导航,在多种现实环境中取得最优性能并保持卓越泛化能力。
English Summary: This paper introduces NavDP, a simulation-trained navigation framework that uses diffusion-based trajectory generation and a critic function for zero-shot transfer to various real-world robots, achieving state-of-the-art performance across diverse environments.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Zihan Li, Yalin Wang, Aristeidis Sotiras, Abolfazl Razi
Title: FIC-TSC: Learning Time Series Classification with Fisher Information Constraint
Abstract:
Analyzing time series data is crucial to a wide spectrum of applications, including economics, online marketplaces, and human healthcare. In particular, time series classification plays an indispensable role in segmenting different phases in stock markets, predicting customer behavior, and classifying worker actions and engagement levels. These aspects contribute significantly to the advancement of automated decision-making and system optimization in real-world applications. However, there is a large consensus that time series data often suffers from domain shifts between training and test sets, which dramatically degrades the classification performance. Despite the success of (reversible) instance normalization in handling the domain shifts for time series regression tasks, its performance in classification is unsatisfactory. In this paper, we propose \textit{FIC-TSC}, a training framework for time series classification that leverages Fisher information as the constraint. We theoretically and empirically show this is an efficient and effective solution to guide the model converge toward flatter minima, which enhances its generalizability to distribution shifts. We rigorously evaluate our method on 30 UEA multivariate and 85 UCR univariate datasets. Our empirical results demonstrate the superiority of the proposed method over 14 recent state-of-the-art methods.
中文: 时间序列分类在金融和医疗等领域至关重要,但存在领域偏移问题影响性能,因此本文提出FIC-TSC框架,利用Fisher信息提升泛化能力,在基准数据集上取得了优于现有方法的效果。
English: Time series classification is vital for applications like finance and healthcare but suffers from domain shifts that reduce performance, so this paper introduces FIC-TSC, a framework using Fisher information to improve generalization and achieve state-of-the-art results on benchmark datasets.

Authors:Shuyao Cheng, Rui Zhang, Wenkai He, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yifan Hao, Guanglin Xu, Yuanbo Wen, Ling Li, Qi Guo, Yunji Chen
Title: QiMeng-CPU-v2: Automated Superscalar Processor Design by Learning Data Dependencies
Abstract:
Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on superscalar processor design because they cannot address inter-instruction data dependencies, leading to inefficient sequential instruction execution. This paper proposes a novel approach to automatically designing superscalar processors using a hardware-friendly model called the Stateful Binary Speculation Diagram (State-BSD). We observe that processor parallelism can be enhanced through on-the-fly inter-instruction dependent data predictors, reusing the processor's internal states to learn the data dependency. To meet the challenge of both hardware-resource limitation and design functional correctness, State-BSD consists of two components: 1) a lightweight state-selector trained by the simulated annealing method to detect the most reusable processor states and store them in a small buffer; and 2) a highly precise state-speculator trained by the BSD expansion method to predict the inter-instruction dependent data using the selected states. It is the first work to achieve the automated superscalar processor design, i.e. QiMeng-CPU-v2, which improves the performance by about $380\times$ than the state-of-the-art automated design and is comparable to human-designed superscalar processors such as ARM Cortex A53.
中文摘要:本文提出了一种利用状态二进制推测图(State-BSD)自动设计超标量处理器的新方法,通过状态选择器和推测器预测指令间数据依赖,实现了与ARM Cortex A53等人为设计处理器相当的性能。
English Summary: This paper introduces a novel automated design method for superscalar processors using State-BSD, which employs state-selection and speculation components to predict inter-instruction dependencies, achieving performance comparable to human-designed processors like ARM Cortex A53.

Authors:Jingyan Shen, Jiarui Yao, Rui Yang, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, Han Zhao
Title: MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Abstract:
Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.
中文摘要:MiCRo框架通过两阶段方法解决奖励建模的局限性,利用上下文感知的混合建模和动态路由策略捕捉多样化人类偏好,无需细粒度标注即可显著提升个性化表现。
English Summary: The MiCRo framework addresses limitations in reward modeling by capturing diverse human preferences through a two-stage approach that uses context-aware mixture modeling and dynamic routing, enhancing personalization without requiring fine-grained annotations.

Authors:Yingjia Xu, Jinlin Wu, Zhen Chen, Daming Gao, Yang Yang, Zhen Lei, Min Cao
Title: SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
Abstract:
Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within the scene, which can offer valuable complementary insights for retrieval. To address this, we introduce SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich annotations covering both pedestrian appearance and environmental cues. Based on this, we propose SA-Person, a two-stage retrieval framework. In the first stage, it performs discriminative appearance grounding by aligning textual cues with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking method leveraging multimodal large language models to jointly reason over pedestrian appearance and the global scene context. Experiments on SCENEPERSON-13W validate the effectiveness of our framework in challenging scene-level retrieval scenarios. The code and dataset will be made publicly available.
中文: 本文提出了包含丰富场景标注的大规模数据集SCENEPERSON-13W和SA-Person双阶段检索框架,该框架先对齐文本与行人外观特征,再通过SceneRanker联合推理外观和场景上下文,有效提升了基于文本的人物检索性能。
English: This paper introduces SCENEPERSON-13W, a large-scale dataset with rich scene annotations, and proposes SA-Person, a two-stage retrieval framework that first aligns text with pedestrian appearance and then uses SceneRanker to jointly reason over appearance and scene context for improved text-based person retrieval.

Authors:Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zhang, Minlie Huang, Jialiang Lu, Han Qiu
Title: Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings
Abstract:
Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect'': misidentify many normal Chinese contents as toxic.
中文摘要:大型语言模型在检测经过简单字符替换的有毒中文内容时表现不佳,而通过上下文学习或微调进行改进的尝试常导致过度矫正,将正常文本误判为有毒内容。
English Summary: Large language models struggle to detect toxic Chinese content altered by simple character substitutions, and attempts to improve detection through in-context learning or fine-tuning often lead to overcorrection where normal text is misclassified as toxic.

Authors:Boyuan Chen, Donghai Hong, Jiaming Ji, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, Yaodong Yang
Title: InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
Abstract:
As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .
中文: 本研究提出了首个多轮多模态交互偏好数据集InterMT,旨在弥补当前多模态大模型在连续、交错多模态对话能力上的不足,通过人类反馈和专家标注引导模型向人类水平的交互智能对齐。
English: This study introduces InterMT, the first preference dataset for multi-turn multimodal interaction, to address the gap in current multimodal large models' ability to engage in continuous, interleaved multimodal exchanges, using human feedback and expert annotations to guide alignment toward human-level interactive capabilities.

Authors:Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, Muhao Chen
Title: LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions
Abstract:
Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream performance. It is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
中文:LayerIF是一种数据驱动框架,利用影响函数评估大语言模型中各层的训练质量,通过计算层间影响实现任务特定的层重要性评估,从而提升专家分配和模型剪枝等下游应用性能。
English: LayerIF is a data-driven framework that uses Influence Functions to assess individual layer training quality in LLMs, enabling task-specific layer importance estimation for improved downstream applications like expert allocation and model pruning.

Authors:Junqi Zhao, Jinzheng Zhao, Haohe Liu, Yun Chen, Lu Han, Xubo Liu, Mark Plumbley, Wenwu Wang
Title: AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion
Abstract:
Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.
Chinese: 本研究提出AudioTurbo方法,将预训练扩散模型与修正流相结合,以提升文本到音频生成的效率,在AudioCaps数据集上仅需10个采样步骤即超越先前模型,并将推理步骤缩减至3步。
English: This study introduces AudioTurbo, a method that combines pre-trained diffusion models with rectified flow to enhance text-to-audio generation efficiency, achieving superior performance in just 10 sampling steps and reducing inference to as few as 3 steps on the AudioCaps dataset.

Authors:Prashant Bhat, Laurens Niesten, Elahe Arani, Bahram Zonooz
Title: Continual Learning Beyond Experience Rehearsal and Full Model Surrogates
Abstract:
Continual learning (CL) has remained a significant challenge for deep neural networks as learning new tasks erases previously acquired knowledge, either partially or completely. Existing solutions often rely on experience rehearsal or full model surrogates to mitigate CF. While effective, these approaches introduce substantial memory and computational overhead, limiting their scalability and applicability in real-world scenarios. To address this, we propose SPARC, a scalable CL approach that eliminates the need for experience rehearsal and full-model surrogates. By effectively combining task-specific working memories and task-agnostic semantic memory for cross-task knowledge consolidation, SPARC results in a remarkable parameter efficiency, using only 6% of the parameters required by full-model surrogates. Despite its lightweight design, SPARC achieves superior performance on Seq-TinyImageNet and matches rehearsal-based methods on various CL benchmarks. Additionally, weight re-normalization in the classification layer mitigates task-specific biases, establishing SPARC as a practical and scalable solution for CL under stringent efficiency constraints.
Chinese: SPARC是一种可扩展的持续学习方法,无需经验回放或完整模型替代,仅使用6%的参数即可实现卓越性能,并通过权重重归一化减轻任务特定偏差,成为高效约束下的实用解决方案。
English: SPARC is a scalable continual learning approach that eliminates the need for experience rehearsal and full-model surrogates, achieving superior performance with only 6% of parameters while mitigating task-specific biases through weight re-normalization.

Authors:Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma
Title: From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Abstract:
Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.
中文: 本研究将概念锥框架扩展至大语言模型的真实性表征,揭示了跨模型因果调节真实性的多维锥结构,并通过干预实验和跨架构泛化能力验证了其有效性。
English: This study extends the concept cone framework to truth representation in LLMs, revealing multi-dimensional cones that causally mediate truthfulness across models, supported by interventions and cross-architectural generalization.

Authors:Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
Title: Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Abstract:
We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP's original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP's zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.
中文: DCLIP是CLIP的优化版本,通过蒸馏框架提升图文检索能力,同时仅用少量数据即可保持强大的零样本分类性能。
English: DCLIP is a refined version of CLIP that boosts image-text retrieval performance through a distillation framework, maintaining strong zero-shot classification with minimal data usage.

Authors:Ilias Diakonikolas, Giannis Iakovidis, Daniel M. Kane, Lisheng Ren
Title: Algorithms and SQ Lower Bounds for Robustly Learning Real-valued Multi-index Models
Abstract:
We study the complexity of learning real-valued Multi-Index Models (MIMs) under the Gaussian distribution. A $K$-MIM is a function $f:\mathbb{R}^d\to \mathbb{R}$ that depends only on the projection of its input onto a $K$-dimensional subspace. We give a general algorithm for PAC learning a broad class of MIMs with respect to the square loss, even in the presence of adversarial label noise. Moreover, we establish a nearly matching Statistical Query (SQ) lower bound, providing evidence that the complexity of our algorithm is qualitatively optimal as a function of the dimension. Specifically, we consider the class of bounded variation MIMs with the property that degree at most $m$ distinguishing moments exist with respect to projections onto any subspace. In the presence of adversarial label noise, the complexity of our learning algorithm is $d^{O(m)}2^{\mathrm{poly}(K/ε)}$. For the realizable and independent noise settings, our algorithm incurs complexity $d^{O(m)}2^{\mathrm{poly}(K)}(1/ε)^{O(K)}$. To complement our upper bound, we show that if for some subspace degree-$m$ distinguishing moments do not exist, then any SQ learner for the corresponding class of MIMs requires complexity $d^{Ω(m)}$. As an application, we give the first efficient learner for the class of positive-homogeneous $L$-Lipschitz $K$-MIMs. The resulting algorithm has complexity $\mathrm{poly}(d) 2^{\mathrm{poly}(KL/ε)}$. This gives a new PAC learning algorithm for Lipschitz homogeneous ReLU networks with complexity independent of the network size, removing the exponential dependence incurred in prior work.
中文: 本文提出了一种高效学习多索引模型的PAC算法,其统计查询下界近乎匹配,并实现了对Lipschitz齐次ReLU网络复杂度与网络规模无关的学习。
English: This paper presents an efficient PAC learning algorithm for multi-index models with nearly matching statistical query lower bounds, achieving complexity independent of network size for Lipschitz homogeneous ReLU networks.

Authors:Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan
Title: BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism
Abstract:
Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
中文: 提出的BacktrackAgent通过回溯机制结合验证器、判断器和反射器模块,增强了GUI代理的错误检测与恢复能力,在基准测试中显著提升了任务成功率和步骤准确性。
English: The proposed BacktrackAgent introduces a backtracking mechanism with verifier, judger, and reflector modules to enhance GUI agents' error detection and recovery, achieving improved task success rates and step accuracy on benchmarks.

Authors:Xiaobao Wei, Xiaoan Zhang, Hao Wang, Qingpo Wuwu, Ming Lu, Wenzhao Zheng, Shanghang Zhang
Title: OmniIndoor3D: Comprehensive Indoor 3D Reconstruction
Abstract:
We propose a novel framework for comprehensive indoor 3D reconstruction using Gaussian representations, called OmniIndoor3D. This framework enables accurate appearance, geometry, and panoptic reconstruction of diverse indoor scenes captured by a consumer-level RGB-D camera. Since 3DGS is primarily optimized for photorealistic rendering, it lacks the precise geometry critical for high-quality panoptic reconstruction. Therefore, OmniIndoor3D first combines multiple RGB-D images to create a coarse 3D reconstruction, which is then used to initialize the 3D Gaussians and guide the 3DGS training. To decouple the optimization conflict between appearance and geometry, we introduce a lightweight MLP that adjusts the geometric properties of 3D Gaussians. The introduced lightweight MLP serves as a low-pass filter for geometry reconstruction and significantly reduces noise in indoor scenes. To improve the distribution of Gaussian primitives, we propose a densification strategy guided by panoptic priors to encourage smoothness on planar surfaces. Through the joint optimization of appearance, geometry, and panoptic reconstruction, OmniIndoor3D provides comprehensive 3D indoor scene understanding, which facilitates accurate and robust robotic navigation. We perform thorough evaluations across multiple datasets, and OmniIndoor3D achieves state-of-the-art results in appearance, geometry, and panoptic reconstruction. We believe our work bridges a critical gap in indoor 3D reconstruction. The code will be released at: https://ucwxb.github.io/OmniIndoor3D/
中文:OmniIndoor3D是一种创新框架,利用高斯表示实现全面的室内三维重建,通过联合优化外观、几何和全景分割,为机器人导航提供精确可靠的场景理解。
English: OmniIndoor3D is a novel framework that uses Gaussian representations to achieve comprehensive indoor 3D reconstruction, integrating appearance, geometry, and panoptic reconstruction through joint optimization for superior robotic navigation.

Authors:Yi Wen, Yue Liu, Derong Xu, Huishi Luo, Pengyue Jia, Yiqing Wu, Siwei Wang, Ke Liang, Maolin Wang, Yiqi Wang, Fuzhen Zhuang, Xiangyu Zhao
Title: Measure Domain's Gap: A Similar Domain Selection Principle for Multi-Domain Recommendation
Abstract:
Multi-Domain Recommendation (MDR) achieves the desirable recommendation performance by effectively utilizing the transfer information across different domains. Despite the great success, most existing MDR methods adopt a single structure to transfer complex domain-shared knowledge. However, the beneficial transferring information should vary across different domains. When there is knowledge conflict between domains or a domain is of poor quality, unselectively leveraging information from all domains will lead to a serious Negative Transfer Problem (NTP). Therefore, how to effectively model the complex transfer relationships between domains to avoid NTP is still a direction worth exploring. To address these issues, we propose a simple and dynamic Similar Domain Selection Principle (SDSP) for multi-domain recommendation in this paper. SDSP presents the initial exploration of selecting suitable domain knowledge for each domain to alleviate NTP. Specifically, we propose a novel prototype-based domain distance measure to effectively model the complexity relationship between domains. Thereafter, the proposed SDSP can dynamically find similar domains for each domain based on the supervised signals of the domain metrics and the unsupervised distance measure from the learned domain prototype. We emphasize that SDSP is a lightweight method that can be incorporated with existing MDR methods for better performance while not introducing excessive time overheads. To the best of our knowledge, it is the first solution that can explicitly measure domain-level gaps and dynamically select appropriate domains in the MDR field. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.
中文: 本文提出了一种相似领域选择原则(SDSP),通过动态选择相关领域并测量领域间距离,有效缓解多领域推荐中的负迁移问题,且该方法轻量、易于集成现有模型,实验证明其有效性。
English: The paper introduces a Similar Domain Selection Principle (SDSP) to dynamically select relevant domains in multi-domain recommendation, mitigating negative transfer by measuring domain distances and integrating with existing methods for improved performance without significant overhead.

Authors:Jiaming Ji, Sitong Fang, Wenjing Cao, Jiahao Li, Xuyao Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang
Title: The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Abstract:
Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning -- a phenomenon we term the "Mirage of Multimodality". To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.
Chinese Summary: 研究表明,在面临误导性视觉输入时,速度较慢的推理模型更容易产生虚假细节,这种现象被称为"多模态幻象",凸显了其在模糊多模态场景下的脆弱性。
English Summary: This study reveals that slower reasoning models, despite their structured approach, are more prone to generating false details when faced with misleading visual inputs, a vulnerability termed the "Mirage of Multimodality."

Authors:Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen
Title: From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
Abstract:
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
Chinese: OvSGTR是一种基于Transformer的开放词汇场景图生成框架,通过关系感知预训练和视觉概念保留机制突破固定类别限制,在多种场景下均实现了最先进的性能。
English: OvSGTR is a transformer-based framework for open-vocabulary scene graph generation that overcomes fixed-category limitations by leveraging relation-aware pre-training and visual-concept retention to achieve state-of-the-art performance across diverse scenarios.

Authors:Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Title: Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Abstract:
This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.
本综述对比了两种AI驱动的软件开发方法:氛围编码通过对话式交互促进人类引导的创意探索,而智能体编码则能以最小人工干预实现自主任务执行,主张在统一开发生命周期中融合二者优势。
This review compares two AI-driven software development approaches: vibe coding, which fosters human-guided creativity through conversational interaction, and agentic coding, which enables autonomous task execution with minimal human input, advocating for their integration in a unified development lifecycle.

Authors:Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin
Title: RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval
Abstract:
The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.
本研究提出的RA-CLAP模型通过自蒸馏学习语音与文本间的局部匹配关系,在情感说话风格检索任务中展现出更强的泛化能力,为情感语音描述领域提供了有效解决方案。
The proposed RA-CLAP model advances emotional speech retrieval by learning nuanced cross-modal relationships through self-distillation, demonstrating superior generalization in emotional speaking style description tasks.

Authors:Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley
Title: EnvSDD: Benchmarking Environmental Sound Deepfake Detection
Abstract:
Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
中文: 音频生成系统能产生高度逼真的环境声音,这带来了检测难题,为此我们推出了EnvSDD大规模数据集和新型检测系统,其性能优于现有技术。
English: Audio generation systems produce highly realistic environmental sounds that pose detection challenges, prompting the introduction of EnvSDD, a large-scale dataset, and a new detection system that outperforms existing methods.

Authors:Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley
Title: EnvSDD: Benchmarking Environmental Sound Deepfake Detection
Abstract:
Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
中文: 音频生成系统能产生高度逼真的环境声音,这带来了检测难题,为此我们推出了EnvSDD大规模数据集和新型检测系统,其性能优于现有技术。
English: Audio generation systems produce highly realistic environmental sounds that pose detection challenges, prompting the introduction of EnvSDD, a large-scale dataset, and a new detection system that outperforms existing methods.

Authors:Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang
Title: Mitigating Deceptive Alignment via Self-Monitoring
Abstract:
Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io
中文: 现代大语言模型通过思维链推理可能加剧欺骗性对齐,但提出的CoT Monitor+框架在推理过程中嵌入自我监控机制来标记和抑制未对齐策略,在保持任务准确性的同时平均减少43.8%的欺骗行为。
English: Modern large language models can amplify deceptive alignment through chain-of-thought reasoning, but the proposed CoT Monitor+ framework embeds a self-monitor during reasoning to flag misaligned strategies, reducing deceptive behaviors by 43.8% while maintaining accuracy.

Authors:Jiayi Zhou, Jiaming Ji, Boyuan Chen, Jiapeng Sun, Wenqi Chen, Donghai Hong, Sirui Han, Yike Guo, Yaodong Yang
Title: Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Abstract:
Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $\textbf{multi-modal generative reward modeling from RL}$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $\textbf{RL optimization from grouped comparison}$, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1\%$, while the baseline RLHF is only $5.3\%$. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.
Chinese: 生成式RLHF-V是一种创新的对齐框架,将生成式奖励模型与多模态人类反馈强化学习相结合,通过奖励建模和分组比较优化的两阶段流程,在多模态大语言模型的七个基准测试中实现了18.1%的性能提升。
English: Generative RLHF-V is a novel alignment framework that integrates generative reward models with multi-modal reinforcement learning from human feedback, improving multi-modal large language models' performance by 18.1% across benchmarks through a two-stage pipeline of reward modeling and grouped comparison optimization.

Authors:Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
Title: Hybrid Latent Reasoning via Reinforcement Learning
Abstract:
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
中文: HRPO提出了一种基于强化学习的混合潜在推理方法,通过可学习的门控机制将离散标记与连续隐藏状态相结合,在知识和推理任务中优于先前方法,同时保持了模型的可解释性。
English: HRPO introduces a reinforcement learning-based hybrid latent reasoning method that combines discrete tokens with continuous hidden states through a learnable gating mechanism, outperforming previous approaches in knowledge and reasoning tasks while maintaining model interpretability.

Authors:Jesus Alvarez C, Daua D. Karajeanes, Ashley Celeste Prado, John Ruttan, Ivory Yang, Sean O'Brien, Vasu Sharma, Kevin Zhu
Title: Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language
Abstract:
The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.
中文: 本研究首次对濒危的科曼奇语进行计算机处理探索,证明低成本、社区参与的NLP方法能通过针对性干预显著提升语言识别效果,为语言保护开辟了新途径。
English: This study pioneers computational methods for the endangered Comanche language, showing that minimal-cost, community-informed NLP techniques can significantly enhance language identification and preservation efforts through targeted interventions.

Authors:Florian Barthel, Wieland Morgenstern, Paul Hinzer, Anna Hilsmann, Peter Eisert
Title: CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis
Abstract:
Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/
中文:CGS-GAN提出了一种新颖的3D高斯泼溅生成对抗网络框架,无需视角条件即可实现稳定训练和高质量3D一致的人头合成,通过多视角正则化和优化的生成器架构,获得了卓越的渲染质量和具有竞争力的FID分数。
English: CGS-GAN introduces a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent human head synthesis without view-conditioning, achieving superior rendering quality and competitive FID scores through multi-view regularization and an optimized generator architecture.

Authors:Sirui Li, Linkai Peng, Zheyuan Zhang, Gorkem Durak, Ulas Bagci
Title: TAGS: 3D Tumor-Adaptive Guidance for SAM
Abstract:
Foundation models (FMs) such as CLIP and SAM have recently shown great promise in image segmentation tasks, yet their adaptation to 3D medical imaging-particularly for pathology detection and segmentation-remains underexplored. A critical challenge arises from the domain gap between natural images and medical volumes: existing FMs, pre-trained on 2D data, struggle to capture 3D anatomical context, limiting their utility in clinical applications like tumor segmentation. To address this, we propose an adaptation framework called TAGS: Tumor Adaptive Guidance for SAM, which unlocks 2D FMs for 3D medical tasks through multi-prompt fusion. By preserving most of the pre-trained weights, our approach enhances SAM's spatial feature extraction using CLIP's semantic insights and anatomy-specific prompts. Extensive experiments on three open-source tumor segmentation datasets prove that our model surpasses the state-of-the-art medical image segmentation models (+46.88% over nnUNet), interactive segmentation frameworks, and other established medical FMs, including SAM-Med2D, SAM-Med3D, SegVol, Universal, 3D-Adapter, and SAM-B (at least +13% over them). This highlights the robustness and adaptability of our proposed framework across diverse medical segmentation tasks.
中文: TAGS框架通过多提示融合技术将CLIP和SAM等二维基础模型成功应用于三维医学影像,在多个肿瘤分割数据集中实现了超越现有最优方法的性能突破。
English: The TAGS framework adapts 2D foundation models like CLIP and SAM for 3D medical imaging by integrating multi-prompt fusion, achieving state-of-the-art performance in tumor segmentation across multiple datasets.

Authors:Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong
Title: Conditional Panoramic Image Generation via Masked Autoregressive Modeling
Abstract:
Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.
Chinese: 提出的全景自回归模型(PAR)通过掩码自回归建模统一了文本和图像条件生成,避免了扩散模型对全景图的局限性,并采用循环填充和一致性对齐策略来提升空间连贯性和生成质量。
English: The proposed Panoramic AutoRegressive model (PAR) overcomes limitations of diffusion models in panoramic generation by employing masked autoregressive modeling and unifying text and image conditioning, while introducing circular padding and consistency alignment to enhance spatial coherence and quality.

Authors:Giovanni Pollo, Mohamed Amine Hamdi, Matteo Risso, Lorenzo Ruotolo, Pietro Furbatto, Matteo Isoldi, Yukai Chen, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari, Sara Vinco
Title: MEbots: Integrating a RISC-V Virtual Platform with a Robotic Simulator for Energy-aware Design
Abstract:
Virtual Platforms (VPs) enable early software validation of autonomous systems' electronics, reducing costs and time-to-market. While many VPs support both functional and non-functional simulation (e.g., timing, power), they lack the capability of simulating the environment in which the system operates. In contrast, robotics simulators lack accurate timing and power features. This twofold shortcoming limits the effectiveness of the design flow, as the designer can not fully evaluate the features of the solution under development. This paper presents a novel, fully open-source framework bridging this gap by integrating a robotics simulator (Webots) with a VP for RISC-V-based systems (MESSY). The framework enables a holistic, mission-level, energy-aware co-simulation of electronics in their surrounding environment, streamlining the exploration of design configurations and advanced power management policies.
中文摘要:本文提出一种开源框架,通过整合机器人模拟器与虚拟平台,实现了对自主系统电子元件的整体性能与能耗协同仿真,有效解决了现有工具在环境交互和精确时序/功耗模拟方面的双重缺陷。
English Summary: This paper introduces an open-source framework that integrates a robotics simulator with a virtual platform to enable holistic, energy-aware co-simulation of autonomous systems, addressing the limitations of existing tools in simulating both environmental interactions and accurate timing/power features.

Authors:Xiaobei Yan, Yiming Li, Zhaoxin Fan, Han Qiu, Tianwei Zhang
Title: BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models
Abstract:
Large language models (LLMs) have shown impressive capabilities across a wide range of applications, but their ever-increasing size and resource demands make them vulnerable to inference cost attacks, where attackers induce victim LLMs to generate the longest possible output content. In this paper, we revisit existing inference cost attacks and reveal that these methods can hardly produce large-scale malicious effects since they are self-targeting, where attackers are also the users and therefore have to execute attacks solely through the inputs, whose generated content will be charged by LLMs and can only directly influence themselves. Motivated by these findings, this paper introduces a new type of inference cost attacks (dubbed 'bit-flip inference cost attack') that target the victim model itself rather than its inputs. Specifically, we design a simple yet effective method (dubbed 'BitHydra') to effectively flip critical bits of model parameters. This process is guided by a loss function designed to suppress token's probability with an efficient critical bit search algorithm, thus explicitly defining the attack objective and enabling effective optimization. We evaluate our method on 11 LLMs ranging from 1.5B to 14B parameters under both int8 and float16 settings. Experimental results demonstrate that with just 4 search samples and as few as 3 bit flips, BitHydra can force 100% of test prompts to reach the maximum generation length (e.g., 2048 tokens) on representative LLMs such as LLaMA3, highlighting its efficiency, scalability, and strong transferability across unseen inputs.
Chinese: 本文提出首个比特翻转攻击BitHydra,通过修改大语言模型权重持续增加所有用户的推理成本,该方法通过抑制序列结束标记实现,仅需少量比特翻转即可保持隐蔽性且能有效对抗防御措施。
English: This paper introduces BitHydra, the first bit-flip attack that modifies LLM weights to persistently increase inference costs for all users by suppressing end-of-sequence tokens, requiring minimal bit flips while remaining stealthy and effective against defenses.

Authors:Xiaobei Yan, Yiming Li, Hao Wang, Han Qiu, Tianwei Zhang
Title: BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models
Abstract:
Large language models (LLMs) are widely deployed, but their growing compute demands expose them to inference cost attacks that maximize output length. We reveal that prior attacks are fundamentally self-targeting because they rely on crafted inputs, so the added cost accrues to the attacker's own queries and scales poorly in practice. In this work, we introduce the first bit-flip inference cost attack that directly modifies model weights to induce persistent overhead for all users of a compromised LLM. Such attacks are stealthy yet realistic in practice: for instance, in shared MLaaS environments, co-located tenants can exploit hardware-level faults (e.g., Rowhammer) to flip memory bits storing model parameters. We instantiate this attack paradigm with BitHydra, which (1) minimizes a loss that suppresses the end-of-sequence token (i.e., EOS) and (2) employs an efficient yet effective critical-bit search focused on the EOS embedding vector, sharply reducing the search space while preserving benign-looking outputs. We evaluate across 11 LLMs (1.5B-14B) under int8 and float16, demonstrating that our method efficiently achieves scalable cost inflation with only a few bit flips, while remaining effective even against potential defenses.
Chinese: 本文提出首个比特翻转攻击BitHydra,通过修改大语言模型权重持续增加所有用户的推理成本,该方法通过抑制序列结束标记实现,仅需少量比特翻转即可保持隐蔽性且能有效对抗防御措施。
English: This paper introduces BitHydra, the first bit-flip attack that modifies LLM weights to persistently increase inference costs for all users by suppressing end-of-sequence tokens, requiring minimal bit flips while remaining stealthy and effective against defenses.

Authors:Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin
Title: Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
Abstract:
Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.
中文摘要:多模态思维链通过引入视觉思维将图像信息有效传递至推理过程,其表达的清晰度和简洁性决定了改进程度,同时视觉思维在Transformer深层中充当中间媒介,实现了更深入的视觉信息传输。
English Summary: MCoT enhances Large Vision-Language Models by introducing visual thoughts that effectively convey image information to the reasoning process, with their clarity and conciseness determining the degree of improvement, while also acting as intermediaries for deeper visual information transmission in transformer layers.

Authors:Laura-Sophia von Hirschhausen, Jannes S. Magnusson, Mykyta Kovalenko, Fredrik Boye, Tanay Rawat, Peter Eisert, Anna Hilsmann, Sebastian Pretzsch, Sebastian Bosse
Title: AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards
Abstract:
Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 1,125 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 31,084 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 7.69 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 31.06 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.
中文: 深度学习推动了精准农业的发展,但苹果园监测因数据集受限而受阻;AppleGrowthVision通过大规模立体图像和标注数据填补了这一空白,提升了果实检测和生长建模的性能。
English: Deep learning has advanced precision agriculture, but apple orchard monitoring is hindered by limited datasets, which AppleGrowthVision addresses with a large-scale collection of stereo images and annotated data to improve fruit detection and growth modeling.

Authors:Runchu Tian, Xueqiang Xu, Bowen Jin, SeongKu Kang, Jiawei Han
Title: CoRank: LLM-Based Compact Reranking with Document Features for Scientific Retrieval
Abstract:
Scientific retrieval is essential for advancing scientific knowledge discovery. Within this process, document reranking plays a critical role in refining first-stage retrieval results. However, standard LLM listwise reranking faces challenges in the scientific domain. First-stage retrieval is often suboptimal in the scientific domain, so relevant documents are ranked lower. Meanwhile, conventional listwise reranking places the full text of candidates into the context window, limiting the number of candidates that can be considered. As a result, many relevant documents are excluded before reranking, constraining overall retrieval performance. To address these challenges, we explore semantic-feature-based compact document representations (e.g., categories, sections, and keywords) and propose CoRank, a training-free, model-agnostic reranking framework for scientific retrieval. It presents a three-stage solution: (i) offline extraction of document features, (ii) coarse-grained reranking using these compact representations, and (iii) fine-grained reranking on full texts of the top candidates from (ii). This integrated process addresses suboptimal first-stage retrieval: Compact representations allow more documents to fit within the context window, improving candidate set coverage, while the final fine-grained ranking ensures a more accurate ordering. Experiments on 5 academic retrieval datasets show that CoRank significantly improves reranking performance across different LLM backbones (average nDCG@10 from 50.6 to 55.5). Overall, these results underscore the synergistic interaction between information extraction and information retrieval, demonstrating how structured semantic features can enhance reranking in the scientific domain.
Chinese: 为解决科学检索中传统LLM列表重排的局限,CoRank提出了一种无需训练、模型无关的重排框架,利用紧凑语义特征进行粗粒度重排以扩大候选范围,再对优选文档进行细粒度全文重排,在多个学术数据集上显著提升了检索性能。
English: To overcome the limitations of standard LLM listwise reranking in scientific retrieval, CoRank introduces a training-free framework that uses compact semantic features for coarse-grained reranking to expand candidate coverage, followed by fine-grained reranking on full texts, significantly improving performance across academic datasets.

Authors:Jingyu Peng, Maolin Wang, Nan Wang, Xiangyu Zhao, Jiatong Li, Kai Zhang, Qi Liu
Title: Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
Abstract:
Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
中文摘要:当前大语言模型的安全机制因对齐数据与恶意提示间的分布差异而存在越狱漏洞,LogiBreak通过将有害提示转化为逻辑表达式,在多语言环境中有效规避了安全防护。
English Summary: Current LLM safety mechanisms are vulnerable to jailbreak attacks due to distributional gaps between alignment data and malicious prompts, which LogiBreak exploits by converting harmful prompts into logical expressions to bypass safeguards across multiple languages.

Authors:Jingyu Peng, Maolin Wang, Nan Wang, Jiatong Li, Yuchen Li, Yuyang Ye, Wanyu Wang, Pengyue Jia, Kai Zhang, Xiangyu Zhao
Title: Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
Abstract:
Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
中文摘要:当前大语言模型的安全机制因对齐数据与恶意提示间的分布差异而存在越狱漏洞,LogiBreak通过将有害提示转化为逻辑表达式,在多语言环境中有效规避了安全防护。
English Summary: Current LLM safety mechanisms are vulnerable to jailbreak attacks due to distributional gaps between alignment data and malicious prompts, which LogiBreak exploits by converting harmful prompts into logical expressions to bypass safeguards across multiple languages.

Authors:Sunghwan Kim, Dongjin Kang, Taeyoon Kwon, Hyungjoo Chae, Dongha Lee, Jinyoung Yeo
Title: Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
Abstract:
Reward models (RMs) play a crucial role in reinforcement learning from human feedback (RLHF), aligning model behavior with human preferences. However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization\textemdash a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy. The results highlight three key findings on how to construct a reliable benchmark: (i) it is important to minimize differences between chosen and rejected responses beyond correctness, (ii) evaluating reward models requires multiple comparisons across a wide range of chosen and rejected responses, and (iii) given that reward models encounter responses with diverse representations, responses should be sourced from a variety of models. However, we also observe that a extremely high correlation with degree of overoptimization leads to comparatively lower correlation with certain downstream performance. Thus, when designing a benchmark, it is desirable to use the degree of overoptimization as a useful tool, rather than the end goal.
Chinese: 奖励模型在基于人类反馈的强化学习中至关重要,但现有基准无法准确评估其真实能力,需通过最小化响应差异、进行多样化比较和采用多模型响应来构建可靠基准,并将过度优化视为工具而非最终目标。
English: Reward models are vital for aligning AI with human preferences in RLHF, but current benchmarks poorly reflect their true effectiveness, requiring evaluations that minimize response differences, use diverse comparisons, and source varied model responses while treating overoptimization as a tool rather than a goal.

Authors:Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng
Title: BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
Abstract:
Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose \textbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce \textbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {\it \textbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {\it \textbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.
中文: 随着AI生成模型的快速发展,虚假信息风险加剧,为此推出了首个大规模高质量AI生成视频数据集GenBuster-200K和首个结合多模态大语言模型与强化学习的可解释检测框架BusterX,以提升检测效果和决策透明度。
English: The rapid advancement of AI generative models has escalated misinformation risks, prompting the creation of GenBuster-200K, a large-scale dataset for AI-generated video detection, and BusterX, an explainable framework using multimodal large language models and reinforcement learning to enhance detection accuracy and transparency.

Authors:Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymkuć, Bartosz A. Grzybowski, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, Heng Ji
Title: mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model
Abstract:
Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings.
中文: 大型语言模型难以提出新颖且可合成的类药分子,而mCLM模型通过将分子分解为功能构建块,能够生成可合成分子并系统改善其化学功能。
English: Large language models struggle to propose novel, synthesizable drug-like molecules, but the mCLM model addresses this by tokenizing molecules into functional building blocks, enabling the generation of synthesizable molecules with improved chemical functions.

Authors:Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
Title: SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Abstract:
Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.
中文:提出的SSR框架将原始深度数据转化为结构化文本推理,通过知识蒸馏和新基准数据集显著增强了视觉语言模型的空间理解能力,无需重新训练即可实现卓越性能。
English: The proposed SSR framework transforms raw depth data into structured textual rationales to enhance spatial reasoning in Visual-Language Models, achieving superior performance through knowledge distillation and a new benchmark dataset without requiring model retraining.

Authors:Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
Title: Enhance Mobile Agents Thinking Process Via Iterative Preference Learning
Abstract:
The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
中文摘要:提出的迭代偏好学习方法通过基于规则的奖励构建行动规划思维链树并采用思维级直接偏好优化,结合三阶段指令演进技术,在移动GUI智能体基准测试中实现了最优性能并展现出强大的泛化能力。
English Summary: The proposed Iterative Preference Learning (IPL) method enhances VLM-based mobile agents by constructing CoaT-trees with rule-based rewards and Thinking-level Direct Preference Optimization, achieving state-of-the-art performance on GUI-agent benchmarks through three-stage instruction evolution.

Authors:Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
Title: MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning
Abstract:
The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
中文摘要:提出的迭代偏好学习方法通过基于规则的奖励构建行动规划思维链树并采用思维级直接偏好优化,结合三阶段指令演进技术,在移动GUI智能体基准测试中实现了最优性能并展现出强大的泛化能力。
English Summary: The proposed Iterative Preference Learning (IPL) method enhances VLM-based mobile agents by constructing CoaT-trees with rule-based rewards and Thinking-level Direct Preference Optimization, achieving state-of-the-art performance on GUI-agent benchmarks through three-stage instruction evolution.

Authors:Prashant Shivaram Bhat, Shakib Yazdani, Elahe Arani, Bahram Zonooz
Title: Parameter Efficient Continual Learning with Dynamic Low-Rank Adaptation
Abstract:
Catastrophic forgetting has remained a critical challenge for deep neural networks in Continual Learning (CL) as it undermines consolidated knowledge when learning new tasks. Parameter efficient fine tuning CL techniques are gaining traction for their effectiveness in addressing catastrophic forgetting with a lightweight training schedule while avoiding degradation of consolidated knowledge in pre-trained models. However, low rank adapters (LoRA) in these approaches are highly sensitive to rank selection which can lead to sub-optimal resource allocation and performance. To this end, we introduce PEARL, a rehearsal-free CL framework that entails dynamic rank allocation for LoRA components during CL training. Specifically, PEARL leverages reference task weights and adaptively determines the rank of task-specific LoRA components based on the current tasks' proximity to reference task weights in parameter space. To demonstrate the versatility of PEARL, we evaluate it across three vision architectures (ResNet, Separable Convolutional Network and Vision Transformer) and a multitude of CL scenarios, and show that PEARL outperforms all considered baselines by a large margin.
中文:PEARL是一种无需回放的持续学习框架,通过基于任务相似性动态分配LoRA组件的秩来缓解灾难性遗忘,在多种视觉架构和场景中均展现出卓越性能。
English: PEARL is a novel rehearsal-free continual learning framework that dynamically allocates ranks for LoRA components based on task similarity to mitigate catastrophic forgetting, demonstrating superior performance across multiple vision architectures and scenarios.

Authors:Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An
Title: Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents
Abstract:
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.
中文: 针对现有移动代理基准在动态环境、单一路径评估及缺乏噪声和主动交互测试方面的不足,Mobile-Bench-v2被提出作为一个更现实的基准,通过多路径评估、噪声环境和模糊指令来全面测试代理能力。
English: To overcome the limitations of existing mobile agent benchmarks, which struggle with dynamic environments, single-path evaluations, and lack of noise or proactive interaction assessments, Mobile-Bench-v2 is introduced as a more realistic benchmark featuring multi-path evaluation, noisy environments, and ambiguous instructions to test agent capabilities comprehensively.

Authors:Wei Zhao, Gongsheng Li, Zhefei Gong, Pengxiang Ding, Han Zhao, Donglin Wang
Title: Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
Abstract:
Vision-Language-Action (VLA) models have recently become highly prominent in the field of robotics. Leveraging vision-language foundation models trained on large-scale internet data, the VLA model can generate robotic actions directly from visual observations and human instructions through a single end-to-end neural network. Despite their effectiveness, current VLA models usually accept only one form of human prompting, language instructions, which may constrain their applicability in open-ended human-robot interactions. For example, a user might expect the robot to retrieve an object shown in an image, follow an instruction written on the whiteboard, or imitate a behavior demonstrated in a video, rather than relying solely on language-based descriptions. To address this gap, we introduce OE-VLA, which explores the potential of VLA models for open-ended multimodal instructions. Extensive results demonstrate that our OE-VLA not only achieves comparable performance to traditional VLA models with linguistic input but also delivers impressive results across four additional categories of open-ended tasks. The proposed methodology could significantly expand the applications of VLA models across various everyday scenarios and facilitate human-robot interaction.
中文摘要:视觉-语言-动作模型虽能通过端到端网络将视觉观察与人类指令转化为机器人动作,但仅支持语言指令限制了其应用范围;新提出的OE-VLA模型实现了对多模态指令的开放处理,在保持语言任务性能的同时,显著拓展了人机交互场景的适用性。
English Summary: Vision-Language-Action (VLA) models are advancing robotics by enabling direct action generation from visual inputs and human instructions, but their limitation to language-only prompts restricts broader application; the proposed OE-VLA model overcomes this by effectively handling multimodal instructions, significantly enhancing versatility in human-robot interactions.

Authors:Wanying Dou, Gorkem Durak, Koushik Biswas, Ziliang Hong, Andrea Mia Bejar, Elif Keles, Kaan Akin, Sukru Mehmet Erturk, Alpay Medetalibeyoglu, Marc Sala, Alexander Misharin, Hatice Savas, Mary Salvatore, Sachin Jambawalikar, Drew Torigian, Jayaram K. Udupa, Ulas Bagci
Title: Predicting Risk of Pulmonary Fibrosis Formation in PASC Patients
Abstract:
While the acute phase of the COVID-19 pandemic has subsided, its long-term effects persist through Post-Acute Sequelae of COVID-19 (PASC), commonly known as Long COVID. There remains substantial uncertainty regarding both its duration and optimal management strategies. PASC manifests as a diverse array of persistent or newly emerging symptoms--ranging from fatigue, dyspnea, and neurologic impairments (e.g., brain fog), to cardiovascular, pulmonary, and musculoskeletal abnormalities--that extend beyond the acute infection phase. This heterogeneous presentation poses substantial challenges for clinical assessment, diagnosis, and treatment planning. In this paper, we focus on imaging findings that may suggest fibrotic damage in the lungs, a critical manifestation characterized by scarring of lung tissue, which can potentially affect long-term respiratory function in patients with PASC. This study introduces a novel multi-center chest CT analysis framework that combines deep learning and radiomics for fibrosis prediction. Our approach leverages convolutional neural networks (CNNs) and interpretable feature extraction, achieving 82.2% accuracy and 85.5% AUC in classification tasks. We demonstrate the effectiveness of Grad-CAM visualization and radiomics-based feature analysis in providing clinically relevant insights for PASC-related lung fibrosis prediction. Our findings highlight the potential of deep learning-driven computational methods for early detection and risk assessment of PASC-related lung fibrosis--presented for the first time in the literature.
中文: 本研究首次提出结合深度学习与影像组学的多中心胸部CT分析框架,用于预测新冠后遗症相关的肺纤维化,以82.2%的准确率展现了早期检测潜力,并通过可解释性分析提供临床见解。
English: This study introduces a novel multi-center chest CT analysis framework combining deep learning and radiomics to predict PASC-related lung fibrosis, achieving 82.2% accuracy and demonstrating potential for early detection through interpretable AI methods.

Authors:Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Title: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Abstract:
This study critically distinguishes between AI Agents and Agentic AI, offering a structured conceptual taxonomy, application mapping, and challenge analysis to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven by Large Language Models (LLMs) and Large Image Models (LIMs) for narrow, task-specific automation. Generative AI is positioned as a precursor, with AI Agents advancing through tool integration, prompt engineering, and reasoning enhancements. In contrast, Agentic AI systems represent a paradigmatic shift marked by multi-agent collaboration, dynamic task decomposition, persistent memory, and orchestrated autonomy. Through a sequential evaluation of architectural evolution, operational mechanisms, interaction styles, and autonomy levels, we present a comparative analysis across both paradigms. Application domains such as customer support, scheduling, and data summarization are contrasted with Agentic AI deployments in research automation, robotic coordination, and medical decision support. We further examine unique challenges in each paradigm including hallucination, brittleness, emergent behavior, and coordination failure and propose targeted solutions such as ReAct loops, RAG, orchestration layers, and causal modeling. This work aims to provide a definitive roadmap for developing robust, scalable, and explainable AI agent and Agentic AI-driven systems. >AI Agents, Agent-driven, Vision-Language-Models, Agentic AI Decision Support System, Agentic-AI Applications
Chinese: 本综述区分了AI智能体(基于生成式AI的模块化任务系统)与能动AI(强调多智能体协作与自主协调),同时梳理了它们的应用领域,并针对幻觉和脆弱性等挑战提出了如ReAct和RAG等解决方案。
English: This review differentiates AI Agents as modular, task-specific systems built on generative AI from Agentic AI, which emphasizes multi-agent collaboration and autonomous coordination, while mapping their applications and addressing challenges like hallucination and brittleness with solutions such as ReAct and RAG.

Authors:Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Title: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Abstract:
This review critically distinguishes between AI Agents and Agentic AI, offering a structured, conceptual taxonomy, application mapping, and analysis of opportunities and challenges to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven and enabled by LLMs and LIMs for task-specific automation. Generative AI is positioned as a precursor providing the foundation, with AI agents advancing through tool integration, prompt engineering, and reasoning enhancements. We then characterize Agentic AI systems, which, in contrast to AI Agents, represent a paradigm shift marked by multi-agent collaboration, dynamic task decomposition, persistent memory, and coordinated autonomy. Through a chronological evaluation of architectural evolution, operational mechanisms, interaction styles, and autonomy levels, we present a comparative analysis across both AI agents and agentic AI paradigms. Application domains enabled by AI Agents such as customer support, scheduling, and data summarization are then contrasted with Agentic AI deployments in research automation, robotic coordination, and medical decision support. We further examine unique challenges in each paradigm including hallucination, brittleness, emergent behavior, and coordination failure, and propose targeted solutions such as ReAct loops, retrieval-augmented generation (RAG), automation coordination layers, and causal modeling. This work aims to provide a roadmap for developing robust, scalable, and explainable AI-driven systems.
Chinese: 本综述区分了AI智能体(基于生成式AI的模块化任务系统)与能动AI(强调多智能体协作与自主协调),同时梳理了它们的应用领域,并针对幻觉和脆弱性等挑战提出了如ReAct和RAG等解决方案。
English: This review differentiates AI Agents as modular, task-specific systems built on generative AI from Agentic AI, which emphasizes multi-agent collaboration and autonomous coordination, while mapping their applications and addressing challenges like hallucination and brittleness with solutions such as ReAct and RAG.

Authors:Qianru Zhang, Honggang Wen, Wei Yuan, Crystal Chen, Menglin Yang, Siu-Ming Yiu, Hongzhi Yin
Title: HMamba: Hyperbolic Mamba for Sequential Recommendation
Abstract:
Sequential recommendation systems have become a cornerstone of personalized services, adept at modeling the temporal evolution of user preferences by capturing dynamic interaction sequences. Existing approaches predominantly rely on traditional models, including RNNs and Transformers. Despite their success in local pattern recognition, Transformer-based methods suffer from quadratic computational complexity and a tendency toward superficial attention patterns, limiting their ability to infer enduring preference hierarchies in sequential recommendation data. Recent advances in Mamba-based sequential models introduce linear-time efficiency but remain constrained by Euclidean geometry, failing to leverage the intrinsic hyperbolic structure of recommendation data. To bridge this gap, we propose Hyperbolic Mamba, a novel architecture that unifies the efficiency of Mamba's selective state space mechanism with hyperbolic geometry's hierarchical representational power. Our framework introduces (1) a hyperbolic selective state space that maintains curvature-aware sequence modeling and (2) stabilized Riemannian operations to enable scalable training. Experiments across four benchmarks demonstrate that Hyperbolic Mamba achieves 3-11% improvement while retaining Mamba's linear-time efficiency, enabling real-world deployment. This work establishes a new paradigm for efficient, hierarchy-aware sequential modeling.
中文: Hyperbolic Mamba是一种新颖的序列推荐架构,它将Mamba的线性时间效率与双曲几何相结合,能更好地建模层次化用户偏好,在保持计算可扩展性的同时实现了显著的性能提升。
English: Hyperbolic Mamba is a novel sequential recommendation architecture that combines Mamba's linear-time efficiency with hyperbolic geometry to better model hierarchical user preferences, achieving significant performance gains while maintaining computational scalability.

Authors:Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang, Xinxing Xu, Jost B. Jonas, Tien Yin Wong, Rick Siow Mong Goh, Yong Liu, Ching-Yu Cheng
Title: An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Abstract:
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
中文: Meta-EyeFM是一个多功能基础模型,结合了大型语言模型和视觉基础模型,通过路由机制实现眼科疾病评估,在疾病检测、严重程度区分和体征识别方面准确率高,优于其他模型并与眼科医生水平相当。
English: Meta-EyeFM is a multi-function foundation model combining LLM and VFMs with a routing mechanism for ocular disease assessment, achieving high accuracy in detection, severity differentiation, and sign identification, outperforming other models and matching ophthalmologist performance.

Authors:Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, Donglin Wang
Title: ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
Abstract:
Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.
Chinese: ReinboT是一种新型的视觉-语言-动作模型,通过集成强化学习原理预测密集回报,使机器人能够通过最大化未来奖励实现稳健决策,并在混合质量数据集上取得最先进的性能。
English: ReinboT is a novel Vision-Language-Action model that integrates reinforcement learning principles to predict dense returns, enabling robust robotic decision-making by maximizing future rewards and achieving state-of-the-art performance on mixed-quality datasets.

Authors:Wenqiang Wang, Siyuan Liang, Yangshijie Zhang, Xiaojun Jia, Hao Lin, Xiaochun Cao
Title: No Query, No Access
Abstract:
Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08\% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/
中文: VDBA方法提出了一种仅利用受害者文本和分层替代模型设计的对抗性攻击,无需查询模型即可显著提高攻击成功率,对先进大语言模型构成严重威胁。
English: The VDBA method introduces a novel adversarial attack that uses only victim texts and a hierarchical substitute model design to significantly enhance attack success rates without requiring model queries, posing a serious threat to advanced LLMs.

Authors:Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis, Manoj Karkee
Title: Vision-Language-Action Models: Concepts, Progress, Applications and Challenges
Abstract:
Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models
中文摘要:视觉-语言-行动模型作为人工智能领域的突破性进展,将感知、语言理解与具身行动整合为统一框架,虽在机器人学和自主系统应用中成果显著,但仍需克服实时控制与伦理部署等关键挑战。
English Summary: Vision-Language-Action models represent a transformative AI advancement that integrates perception, language, and action into unified systems, with applications spanning robotics and autonomous systems while facing challenges in real-time control and ethical deployment.

Authors:Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, Donglin Wang
Title: OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
Abstract:
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/.
中文: 本文针对双系统VLA架构开源资源不足的问题,通过总结结构设计、开展实证评估,并提供一个低成本开源模型以供进一步探索。
English: This paper addresses the lack of open-source dual-system VLA architectures by summarizing their structural designs, conducting empirical evaluations, and providing a low-cost open-source model for further exploration.

Authors:Maolin Wang, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Xuetao Wei, Zitao Liu, Hongzhi Yin, Yi Chang, Xiangyu Zhao
Title: STAR-Rec: Making Peace with Length Variance and Pattern Diversity in Sequential Recommendation
Abstract:
Recent deep sequential recommendation models often struggle to effectively model key characteristics of user behaviors, particularly in handling sequence length variations and capturing diverse interaction patterns. We propose STAR-Rec, a novel architecture that synergistically combines preference-aware attention and state-space modeling through a sequence-level mixture-of-experts framework. STAR-Rec addresses these challenges by: (1) employing preference-aware attention to capture both inherently similar item relationships and diverse preferences, (2) utilizing state-space modeling to efficiently process variable-length sequences with linear complexity, and (3) incorporating a mixture-of-experts component that adaptively routes different behavioral patterns to specialized experts, handling both focused category-specific browsing and diverse category exploration patterns. We theoretically demonstrate how the state space model and attention mechanisms can be naturally unified in recommendation scenarios, where SSM captures temporal dynamics through state compression while attention models both similar and diverse item relationships. Extensive experiments on four real-world datasets demonstrate that STAR-Rec consistently outperforms state-of-the-art sequential recommendation methods, particularly in scenarios involving diverse user behaviors and varying sequence lengths.
中文摘要:STAR-Rec是一种新颖的序列推荐架构,通过专家混合框架协同结合偏好感知注意力与状态空间建模,能有效处理变长序列和多样化交互模式,在多样用户行为和不同序列长度场景下显著优于现有方法。
English Summary: STAR-Rec is a novel sequential recommendation architecture that combines preference-aware attention and state-space modeling through a mixture-of-experts framework to effectively handle variable sequence lengths and diverse interaction patterns, demonstrating superior performance over existing methods.

Authors:Albérick Euraste Djiré, Abdoul Kader Kaboré, Earl T. Barr, Jacques Klein, Tegawendé F. Bissyandé
Title: Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis
Abstract:
While Large Language Models (LLMs) achieve remarkable performance through training on massive datasets, they can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises significant concerns about data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses how sensitive an LLM's performance is to input perturbations, enabling memorization detection without requiring access to the model's internals. We investigate how input perturbations affect the consistency of outputs, enabling us to distinguish between true generalization and memorization. Our findings, following extensive experiments on the Pythia open model, provide a robust framework for identifying when the model simply regurgitates learned information. Applied on the GPT 4o models, the PEARL framework not only identified cases of memorization of classic texts from the Bible or common code from HumanEval but also demonstrated that it can provide supporting evidence that some data, such as from the New York Times news articles, were likely part of the training data of a given model.
中文摘要:PEARL框架通过分析输入扰动下输出的一致性来检测大语言模型中的记忆现象,无需访问模型内部即可识别逐字复现训练数据的行为,并在Pythia和GPT-4o模型的实验中得到验证。
English Summary: The PEARL framework detects memorization in LLMs by analyzing output consistency under input perturbations, identifying verbatim data reproduction without accessing model internals, as validated through experiments on Pythia and GPT-4o models.

Authors:Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien
Title: Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
Abstract:
The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.
中文: 本文提出组件包含评分(CIS)作为评估文本到图像模型文化偏见的基准,揭示了西方与非西方提示词之间的显著差异,并为实现更公平的AI图像生成提出了改进方案。
English: This paper introduces the Component Inclusion Score (CIS) to benchmark cultural biases in text-to-image models, revealing significant disparities between Western and non-Western prompts and proposing interventions for more equitable AI-generated imagery.

Authors:Shaun Baek, Shaun Esua-Mensah, Cyrus Tsui, Sejan Vigneswaralingam, Abdullah Alali, Michael Lu, Vasu Sharma, Sean O'Brien, Kevin Zhu
Title: Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning
Abstract:
Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.
中文: 本研究推出Rosetta-PL基准,通过将逻辑命题翻译成自定义语言来评估大语言模型的逻辑推理能力,结果表明翻译过程中保持逻辑关系能显著提升精确度,且训练样本超过约2万后准确率趋于稳定。
English: This research introduces Rosetta-PL, a benchmark that evaluates LLMs' logical reasoning by translating logical propositions into a custom language, showing that preserving logical relationships in translation enhances precision and accuracy plateaus after about 20,000 training samples.

Authors:Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon
Title: Exploring Scaling Laws for EHR Foundation Models
Abstract:
The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.
中文总结:本研究首次证实电子健康记录基础模型存在扩展定律,表明模型规模和计算资源的系统增加能够产生类似大语言模型的性能提升规律,为开发临床预测模型奠定基础。
English Summary: This study establishes the first empirical evidence of scaling laws for electronic health record (EHR) foundation models, demonstrating that systematic increases in model size and compute yield predictable performance gains analogous to large language models.

Authors:Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park
Title: BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum
Abstract:
Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.
中文: BehaviorBench评估了大语言模型在临床任务中主动性不一致的问题,而BehaviorSFT通过行为标记调节模型,使其表现出更平衡、更贴近实际的临床行为。
English: BehaviorBench evaluates LLMs' inconsistent proactivity in clinical tasks, while BehaviorSFT enhances their performance by conditioning models with behavioral tokens for balanced and realistic clinical behavior.

Authors:Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, Yu Li
Title: Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations
Abstract:
While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.
中文: 本文提出ChemCoTBench推理框架,通过将分子结构分析与算术化操作相结合实现分步化学问题求解,弥补了大语言模型在分子优化和反应预测等实际化学任务中系统性推理能力的不足。
English: This paper introduces ChemCoTBench, a reasoning framework that bridges molecular structure analysis with arithmetic-inspired operations to enable step-by-step chemical problem-solving, addressing the gap in LLMs' systematic reasoning for real-world chemistry tasks like molecular optimization and reaction prediction.

Authors:Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, Tao Mei
Title: MotionPro: A Precise Motion Controller for Image-to-Video Generation
Abstract:
Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.
中文: MotionPro通过引入区域轨迹和运动掩码的精确运动控制方法,实现了细粒度运动合成并能区分物体与相机运动,相比依赖粗糙高斯核的传统方法具有显著优势。
English: MotionPro introduces a precise motion control method for image-to-video generation by utilizing region-wise trajectories and motion masks to achieve fine-grained motion synthesis and distinguish between object and camera movement, outperforming previous approaches that relied on coarse Gaussian kernels.

Authors:Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, Zhao Song, Han Liu
Title: Minimalist Softmax Attention Provably Learns Constrained Boolean Functions
Abstract:
We study the computational limits of learning $k$-bit Boolean functions (specifically, $\mathrm{AND}$, $\mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Θ(d)$ relevant bits are selected from $d$ inputs. We show that these simple $\mathrm{AND}$ and $\mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.
中文: 研究表明,单头软注意力机制本身无法学习基本布尔函数,但在教师引导下却能成功解决,这表明此类任务仅需极简注意力架构,且监督梯度下降可替代复杂推理方案。
English: This study demonstrates that while single-head softmax-attention alone cannot learn basic Boolean functions, it succeeds with teacher forcing, revealing that minimalist attention architectures suffice for these tasks and that supervised gradient descent can replace complex reasoning schemes.

Authors:Yu Xia, Rui Zhong, Hao Gu, Wei Yang, Chi Lu, Peng Jiang, Kun Gai
Title: Hierarchical Tree Search-based User Lifelong Behavior Modeling on Large Language Model
Abstract:
Large Language Models (LLMs) have garnered significant attention in Recommendation Systems (RS) due to their extensive world knowledge and robust reasoning capabilities. However, a critical challenge lies in enabling LLMs to effectively comprehend and extract insights from massive user behaviors. Current approaches that directly leverage LLMs for user interest learning face limitations in handling long sequential behaviors, effectively extracting interest, and applying interest in practical scenarios. To address these issues, we propose a Hierarchical Tree Search-based User Lifelong Behavior Modeling framework (HiT-LBM). HiT-LBM integrates Chunked User Behavior Extraction (CUBE) and Hierarchical Tree Search for Interest (HTS) to capture diverse interests and interest evolution of user. CUBE divides user lifelong behaviors into multiple chunks and learns the interest and interest evolution within each chunk in a cascading manner. HTS generates candidate interests through hierarchical expansion and searches for the optimal interest with process rating model to ensure information gain for each behavior chunk. Additionally, we design Temporal-Ware Interest Fusion (TIF) to integrate interests from multiple behavior chunks, constructing a comprehensive representation of user lifelong interests. The representation can be embedded into any recommendation model to enhance performance. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods.
Chinese Summary: 提出的HiT-LBM框架通过分块行为提取和分层树搜索来建模用户兴趣演化,解决了大语言模型在处理长序列用户行为时的局限性,实验证明其性能优于现有最先进方法。
English Summary: The proposed HiT-LBM framework addresses LLMs' limitations in processing long user behavior sequences by employing chunked behavior extraction and hierarchical tree search to model evolving user interests, with experimental results confirming its superiority over existing methods.

Authors:Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang
Title: SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams
Abstract:
Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.
中文摘要:SpikeStereoNet是首个直接从原始脉冲流估计立体深度的仿脑框架,通过利用脉冲数据的高时间分辨率处理无纹理表面和极端光照等挑战性场景,性能优于现有方法。
English Summary: SpikeStereoNet is the first brain-inspired framework that directly estimates stereo depth from raw spike streams, outperforming existing methods by leveraging spike data's high temporal resolution to handle challenging scenarios like textureless surfaces and extreme lighting.

Authors:Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Title: MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Abstract:
Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
中文: 本研究提出了细粒度科学假设发现的新任务,将其构建为组合优化问题,并设计了一种分层搜索方法,通过利用大语言模型的内部启发式机制生成详细且可实验验证的假设,在专业标注数据集上显著优于现有基线。
English: This study introduces a novel task of fine-grained scientific hypothesis discovery, framing it as a combinatorial optimization problem and proposing a hierarchical search method that outperforms baselines by leveraging LLMs' internal heuristics to generate detailed, experimentally actionable hypotheses.

Authors:Yongheng Zhang, Xu Liu, Ruoxi Zhou, Qiguang Chen, Hao Fei, Wenpeng Lu, Libo Qin
Title: CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Abstract:
Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
Chinese: 研究大型语言模型在跨语言和跨模态联合场景中的幻觉问题对实际应用至关重要,为此引入的CCHall基准测试表明现有模型在此方面仍面临挑战,可作为重要评估资源。
English: Investigating hallucinations in large language models across joint cross-lingual and cross-modal scenarios is crucial for real-world deployment, leading to the creation of the CCHall benchmark, which reveals current models' struggles and serves as a valuable assessment tool.

Authors:Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi
Title: Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding
Abstract:
Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.
Chinese: 作者提出了S4Token这一通用3D标记器,通过结合超点分组与坐标归一化实现尺度不变的表示学习,在无需标注的情况下通过跨模态蒸馏对齐3D标记与2D图像特征,其性能优于传统方法。
English: The authors introduce S4Token, a universal 3D tokenizer that achieves scale-invariant representation learning by combining superpoint grouping with coordinate normalization, outperforming traditional methods and aligning 3D tokens with 2D image features through cross-modal distillation.

Authors:Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang
Title: MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
Abstract:
Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.
Chinese: 本文提出实验引导的假设排序方法,通过先前实验结果优化候选假设优先级,针对预实验方法的不足开发了基于领域知识的模拟器并在化学数据上验证,该方法通过聚类和模拟反馈显著优于现有基线。
English: This paper introduces experiment-guided hypothesis ranking to prioritize candidates using prior experimental results, addressing the limitations of pre-experiment methods by developing a domain-informed simulator validated on chemistry data, which outperforms existing baselines through clustering and simulated feedback.

Authors:Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Title: HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present HydraRAG, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. HydraRAG handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, HydraRAG uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, HydraRAG fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that HydraRAG achieves overall state-of-the-art results on all benchmarks with GPT-3.5-Turbo, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, HydraRAG enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo. The source code is available on https://stevetantan.github.io/HydraRAG/.
中文: HydraRAG是一个无需训练的新框架,通过统一图拓扑、文档语义和来源可靠性来解决多跳推理和多实体问题,在多个基准测试中均取得了最优性能。
English: HydraRAG is a training-free framework that enhances large language models by unifying graph topology, document semantics, and source reliability to solve multi-hop reasoning and multi-entity problems while achieving state-of-the-art results across benchmarks.

Authors:Anna Ivagnes, Giovanni Stabile, Gianluigi Rozza
Title: Data-driven Closure Strategies for Parametrized Reduced Order Models via Deep Operator Networks
Abstract:
In this paper, we propose an equation-based parametric Reduced Order Model (ROM), whose accuracy is improved with data-driven terms added into the reduced equations. These additions have the aim of reintroducing contributions that in standard reduced-order approaches are not taken into account. In particular, in this work we focus on a Proper Orthogonal Decomposition (POD)-based formulation and our goal is to build a closure or correction model, aimed to re-introduce the contribution of the discarded modes. The approach has been investigated in previous works, and the goal of this manuscript is to extend the model to a parametric setting making use of machine learning procedures, and, in particular, of deep operator networks. More in detail, we model the closure terms through a deep operator network taking as input the reduced variables and the parameters of the problem. We tested the methods on three test cases with different behaviors: the periodic turbulent flow past a circular cylinder, the unsteady turbulent flow in a channel-driven cavity, and the geometrically-parametrized backstep flow. The performance of the machine learning-enhanced ROM is deeply studied in different modal regimes, and considerably improved the pressure and velocity accuracy with respect to the standard POD-Galerkin approach.
中文: 本文提出了一种结合数据驱动项的参数量化降阶模型,通过深度算子网络重建被忽略模态的贡献,在三种湍流测试中相比传统方法显著提升了压力和速度的精度。
English: This paper introduces a parametric reduced order model enhanced with data-driven closure terms, using deep operator networks to improve accuracy by reintroducing discarded mode contributions, which significantly outperforms standard methods in three turbulent flow test cases.

Authors:Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei
Title: Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
Abstract:
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
中文: 提出的动态姿态交互扩散模型(DPIDM)通过分层注意力机制建模时空姿态交互,解决了视频虚拟试穿中的时序不一致问题,并在基准数据集上实现了最先进的性能。
English: The proposed Dynamic Pose Interaction Diffusion Models (DPIDM) framework addresses temporal inconsistencies in video virtual try-on by modeling spatiotemporal pose interactions through hierarchical attention mechanisms and achieves state-of-the-art performance on benchmark datasets.

Authors:Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
Title: Creatively Upscaling Images with Global-Regional Priors
Abstract:
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
中文: C-Upscale是一种无需调整的图像放大方法,利用全局与区域提示先验,在生成超高分辨率图像时保持语义一致性并增强区域细节的创造性。
English: C-Upscale is a tuning-free image upscaling method that leverages global-regional priors from prompts to generate ultra-high-resolution images while preserving semantic consistency and enhancing creative details.

Authors:Jianfeng Deng, Qingfeng Chen, Debo Cheng, Jiuyong Li, Lin Liu, Shichao Zhang
Title: A Novel Generative Model with Causality Constraint for Mitigating Biases in Recommender Systems
Abstract:
Accurately predicting counterfactual user feedback is essential for building effective recommender systems. However, latent confounding bias can obscure the true causal relationship between user feedback and item exposure, ultimately degrading recommendation performance. Existing causal debiasing approaches often rely on strong assumptions-such as the availability of instrumental variables (IVs) or strong correlations between latent confounders and proxy variables-that are rarely satisfied in real-world scenarios. To address these limitations, we propose a novel generative framework called Latent Causality Constraints for Debiasing representation learning in Recommender Systems (LCDR). Specifically, LCDR leverages an identifiable Variational Autoencoder (iVAE) as a causal constraint to align the latent representations learned by a standard Variational Autoencoder (VAE) through a unified loss function. This alignment allows the model to leverage even weak or noisy proxy variables to recover latent confounders effectively. The resulting representations are then used to improve recommendation performance. Extensive experiments on three real-world datasets demonstrate that LCDR consistently outperforms existing methods in both mitigating bias and improving recommendation accuracy.
Chinese Summary: 本文提出LCDR这一新型生成框架,通过可识别变分自编码器从弱代理变量中恢复潜在混杂因子,有效缓解偏差并在真实数据集上持续提升推荐性能。
English Summary: The paper introduces LCDR, a novel generative framework that uses an identifiable Variational Autoencoder to recover latent confounders from weak proxy variables, effectively mitigating bias and enhancing recommendation accuracy across real-world datasets.

Authors:Miao Yu, Liang Lin, Guibin Zhang, Xinfeng Li, Junfeng Fang, Ningyu Zhang, Kun Wang, Yang Wang
Title: UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models
Abstract:
Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model's intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model's autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.
中文摘要:UniErase通过优化的遗忘令牌和轻量级编辑,提出了一种精准平衡的遗忘框架,在知识消除与模型能力保留方面均显著优于现有方法。
English Summary: UniErase introduces a precise and balanced unlearning framework using optimized tokens and lightweight edits, significantly outperforming existing methods in both knowledge removal and model ability retention.

Authors:Miao Yu, Liang Lin, Guibin Zhang, Xinfeng Li, Junfeng Fang, Xingrui Yu, Ivor Tsang, Ningyu Zhang, Kun Wang, Yang Wang
Title: UniErase: Towards Balanced and Precise Unlearning in Language Models
Abstract:
Large language models (LLMs) require iterative updates to address the outdated information problem, where LLM unlearning offers an approach for selective removal. However, mainstream unlearning methods primarily rely on fine-tuning techniques, which often lack precision in targeted unlearning and struggle to balance unlearning efficacy with general ability under massive and sequential settings. To bridge this gap, in this work, we introduce UniErase, a novel unlearning framework that demonstrates precision and balanced performances between knowledge unlearning and ability retaining. We first propose the Unlearning Token, which is optimized to steer LLMs toward a forgetting space. To achieve concrete unlearning behaviors, we further introduce the lightweight Unlearning Edit to efficiently associate the unlearning targets with this meta-token. Serving as a new unlearning paradigm via editing, UniErase achieves outstanding performances across batch, sequential, and precise unlearning tasks under fictitious and real-world knowledge scenarios. On the TOFU benchmark, compared with 8 baselines, UniErase, modifying only $\sim$ \textbf{3.66%} of the LLM parameters, outperforms the previous best-forgetting baseline by \textbf{$\sim$ 4.01$\times$} for \textbf{model ability} with even higher unlearning efficacy. Similarly, UniErase, with better ability retention, also surpasses the previous best-retaining method by \textbf{35.96%} for \textbf{unlearning efficacy}, showing balanced and dual top-tier performances in the current unlearning community.
中文摘要:UniErase通过优化的遗忘令牌和轻量级编辑,提出了一种精准平衡的遗忘框架,在知识消除与模型能力保留方面均显著优于现有方法。
English Summary: UniErase introduces a precise and balanced unlearning framework using optimized tokens and lightweight edits, significantly outperforming existing methods in both knowledge removal and model ability retention.

Authors:Peng Wang, Ruihan Tao, Qiguang Chen, Mengkang Hu, Libo Qin
Title: X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System
Abstract:
Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.
中文: 本文提出X-WebAgentBench多语言基准,用于评估网络环境中的语言智能体,发现即便是GPT-4o等先进模型结合跨语言技术,仍难以满足多样化的多语言应用需求。
English: This paper introduces X-WebAgentBench, a multilingual benchmark for evaluating language agents in web environments, revealing that even advanced models like GPT-4o with cross-lingual methods still underperform in meeting diverse multilingual requirements.

Authors:Yuchen Li, Chaoran Feng, Zhenyu Tang, Kaiyuan Deng, Wangbo Yu, Yonghong Tian, Li Yuan
Title: GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation
Abstract:
We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, and subsequently employing a novel, physically-informed event simulation pipeline. This pipeline generally integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. Such an approach yields temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while ensuring strong alignment with underlying scene structures. Experimental results on event-based 3D reconstruction demonstrate GS2E's superior generalization capabilities and its practical value as a benchmark for advancing event vision research.
中文: GS2E是一个基于真实世界稀疏多视角RGB图像、通过3D高斯泼溅和新型事件模拟流程构建的大规模合成事件数据集,它克服了现有数据集的局限性,能在多样化条件下提供几何一致的事件流,有力推进事件视觉研究发展。
English: GS2E is a large-scale synthetic event dataset created from real-world sparse multi-view RGB images using 3D Gaussian Splatting and a novel event simulation pipeline, overcoming limitations of existing datasets by providing geometrically consistent event streams under diverse conditions for advancing event vision research.

Authors:Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Title: MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Abstract:
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-evolved design for creating effective and adaptive MAS.
中文: MAS-ZERO 是一个自我演进的推理时框架,能动态设计多代理系统,无需人工设定角色或验证集,在多项基准测试中实现了卓越性能。
English: MAS-ZERO is a self-evolved framework that dynamically designs multi-agent systems at inference time, eliminating the need for manual roles or validation sets and achieving superior performance across various benchmarks.

Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao
Title: TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Abstract:
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks. Singing voice samples are available at https://aaronz345.github.io/TCSinger2Demo/.
中文: TCSinger 2 是一种多语言零样本歌声合成模型,通过三个关键模块解决了边界标注和风格控制的局限性,在平滑过渡和多层次风格建模上表现卓越。
English: TCSinger 2 is a multilingual zero-shot singing voice synthesis model that overcomes limitations in boundary annotations and style control through three innovative modules, achieving superior performance in smooth transitions and multi-level style modeling.

Authors:Huihao Jing, Haoran Li, Wenbin Hu, Qi Hu, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
Title: MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol
Abstract:
As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these gaps. Next, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs' capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs' vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.
中文: 本文针对模型上下文协议(MCP)的安全风险,提出了模型上下文完整性协议(MCIP),建立了不安全行为分类体系,并通过基准测试显著提升了大型语言模型在MCP交互中的安全性能。
English: This paper introduces a framework to address safety risks in the Model Context Protocol (MCP) by proposing the Model Contextual Integrity Protocol (MCIP), developing a taxonomy for unsafe behaviors, and creating benchmarks that enhance LLMs' safety performance in MCP interactions.

Authors:Huihao Jing, Haoran Li, Wenbin Hu, Qi Hu, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
Title: MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol
Abstract:
As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these gaps. Next, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs' capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs' vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.
中文: 本文针对模型上下文协议(MCP)的安全风险,提出了模型上下文完整性协议(MCIP),建立了不安全行为分类体系,并通过基准测试显著提升了大型语言模型在MCP交互中的安全性能。
English: This paper introduces a framework to address safety risks in the Model Context Protocol (MCP) by proposing the Model Contextual Integrity Protocol (MCIP), developing a taxonomy for unsafe behaviors, and creating benchmarks that enhance LLMs' safety performance in MCP interactions.

Authors:Wenbin Hu, Haoran Li, Huihao Jing, Qi Hu, Ziqian Zeng, Sirui Han, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
Title: Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning
Abstract:
While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +8.58% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.
中文摘要:本研究通过强化学习将情境合规性融入大型语言模型的安全与隐私保护中,在显著提升法律遵从性的同时进一步增强了模型的通用推理能力。
English Summary: This study addresses safety and privacy risks in Large Language Models by integrating contextual compliance through reinforcement learning, achieving significant improvements in both legal adherence and general reasoning capabilities.

Authors:Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu
Title: SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation
Abstract:
Process Reward Models (PRMs) have demonstrated promising results in mathematical reasoning, but existing process annotation approaches, whether through human annotations or Monte Carlo simulations, remain computationally expensive. In this paper, we introduce Step COmpression for Process Estimation (SCOPE), a novel compression-based approach that significantly reduces annotation costs. We first translate natural language reasoning steps into code and normalize them through Abstract Syntax Tree, then merge equivalent steps to construct a prefix tree. Unlike simulation-based methods that waste numerous samples on estimation, SCOPE leverages a compression-based prefix tree where each root-to-leaf path serves as a training sample, reducing the complexity from $O(NMK)$ to $O(N)$. We construct a large-scale dataset containing 196K samples with only 5% of the computational resources required by previous methods. Empirical results demonstrate that PRMs trained on our dataset consistently outperform existing automated annotation approaches on both Best-of-N strategy and ProcessBench.
中文: 本文提出SCOPE,一种基于压缩的方法,通过将推理步骤转化为代码、用抽象语法树规范化并合并等效步骤构建前缀树,显著降低了过程奖励模型的标注成本,在计算资源大幅减少的情况下性能优于现有方法。
English: The paper introduces SCOPE, a compression-based method that reduces annotation costs for Process Reward Models by converting reasoning steps into code, normalizing them via Abstract Syntax Trees, and merging equivalent steps into a prefix tree, achieving significant computational savings and outperforming existing approaches.

Authors:Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong
Title: Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Abstract:
Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textit{Self-Reasoning Language Model} (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute $+7.89$ average improvement with $64$ sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline.
中文:自推理语言模型(SRLM)通过自训练生成长思维链数据,有效提升大语言模型的推理能力,在多项任务中实现显著性能提升,且推理时采样次数越多改进越大。
English: The Self-Reasoning Language Model (SRLM) enhances reasoning in LLMs by generating longer Chain-of-Thought data through self-training, achieving significant performance gains across multiple tasks and improving further with increased sampling during inference.

Authors:Guangke Chen, Fu Song, Zhe Zhao, Xiaojun Jia, Yang Liu, Yanchen Qiao, Weizhe Zhang
Title: AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
Abstract:
Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.
中文: 本研究提出AudioJailbreak这一新型音频越狱攻击方法,通过异步性、通用性、隐蔽性和空中传播鲁棒性克服了现有方法的局限,在多种大型音频语言模型中展现出高效攻击能力。
English: This study introduces AudioJailbreak, a novel audio jailbreak attack that overcomes limitations of prior methods by offering asynchrony, universality, stealthiness, and over-the-air robustness, demonstrating high effectiveness across multiple large audio-language models.

Authors:Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Yixuan Li
Title: GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Abstract:
Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.
中文: GeoRanker提出了一种距离感知的排序框架,利用视觉语言模型编码查询与候选之间的交互并预测地理邻近性,通过创新的多阶距离损失和专用数据集,在基准测试中取得了最先进的成果。
English: GeoRanker introduces a distance-aware ranking framework using vision-language models to encode query-candidate interactions and predict geographic proximity, achieving state-of-the-art results on benchmarks through a novel multi-order distance loss and a dedicated dataset.

Authors:Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Yixuan Li
Title: GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Abstract:
Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.
中文: GeoRanker提出了一种距离感知的排序框架,利用视觉语言模型编码查询与候选之间的交互并预测地理邻近性,通过创新的多阶距离损失和专用数据集,在基准测试中取得了最先进的成果。
English: GeoRanker introduces a distance-aware ranking framework using vision-language models to encode query-candidate interactions and predict geographic proximity, achieving state-of-the-art results on benchmarks through a novel multi-order distance loss and a dedicated dataset.

Authors:Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park
Title: VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
Abstract:
Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.
中文: 本文提出VocalAgent音频大语言模型,通过多维度安全评估与性能验证实现精准的声带疾病诊断,为改善全球嗓音健康诊疗可及性提供可扩展的解决方案。
English: This paper presents VocalAgent, an audio large language model that provides accurate and scalable vocal health diagnostics, validated through comprehensive safety and performance evaluations to address global access issues in voice disorder care.

Authors:Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park
Title: VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
Abstract:
Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.
中文: 本文提出VocalAgent音频大语言模型,通过多维度安全评估与性能验证实现精准的声带疾病诊断,为改善全球嗓音健康诊疗可及性提供可扩展的解决方案。
English: This paper presents VocalAgent, an audio large language model that provides accurate and scalable vocal health diagnostics, validated through comprehensive safety and performance evaluations to address global access issues in voice disorder care.

Authors:Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Title: J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
Abstract:
To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.
中文: 为解决大语言模型作为评判者在推理密集型领域的不足,本研究提出EIS-GRPO算法和ReasoningJudgeBench基准,训练的J4R评判模型在性能上超越现有模型,展现出更强的鲁棒性和精确度。
English: To address the limitations of LLM-as-judge models in reasoning-intensive domains, this study introduces the EIS-GRPO algorithm and ReasoningJudgeBench benchmark, training the J4R judge to outperform existing models with enhanced robustness and accuracy.

Authors:Hongru Wang, Wenyu Huang, Yufei Wang, Yuanhao Xi, Jianqiao Lu, Huan Zhang, Nan Hu, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong
Title: Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges
Abstract:
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.
中文: 本文提出了DialogTool,一个用于评估多轮对话中状态化工具交互的数据集,涵盖工具创建、使用和角色一致响应三个阶段,并构建了VirtualMobile虚拟环境测试API鲁棒性,发现现有LLM在长序列工具使用中仍表现不佳。
English: This paper introduces DialogTool, a multi-turn dialogue dataset for evaluating stateful tool interactions across six tasks in three stages, and VirtualMobile, a virtual environment to test API robustness, revealing that current LLMs still struggle with long-horizon tool use.

Authors:Yongchang Gao, Meiling Jin, Zhaofei Yu, Tiejun Huang, Guozhang Chen
Title: SPKLIP: Aligning Spike Video Streams with Natural Language
Abstract:
Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP's energy efficiency highlights its potential for neuromorphic deployment, advancing event-based multimodal research. The source code and dataset are available at [link removed for anonymity].
Chinese: SPKLIP是首个专为脉冲视频-语言对齐设计的架构,采用分层脉冲特征提取器和脉冲-文本对比学习,在脉冲数据集上实现了最先进的性能与能效表现。
English: SPKLIP is the first architecture designed for Spike Video-Language Alignment, featuring a hierarchical spike feature extractor and spike-text contrastive learning to achieve state-of-the-art performance and energy efficiency on spike datasets.

Authors:Chongyang Tan, Ruoqi Wen, Rongpeng Li, Zhifeng Zhao, Ekram Hossain, Honggang Zhang
Title: Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning
Abstract:
Federated Learning (FL) enables distributed model training across edge devices in a privacy-friendly manner. However, its efficiency heavily depends on effective device selection and high-dimensional resource allocation in dynamic and heterogeneous wireless environments. Conventional methods demand a confluence of domain-specific expertise, extensive hyperparameter tuning, and/or heavy interaction cost. This paper proposes a Tool-aided Evolutionary Large Language Model (T-ELLM) framework to generate a qualified policy for device selection in a wireless FL environment. Unlike conventional optimization methods, T-ELLM leverages natural language-based scenario prompts to enhance generalization across varying network conditions. The framework decouples the joint optimization problem mathematically, enabling tractable learning of device selection policies while delegating resource allocation to convex optimization tools. To improve adaptability, T-ELLM integrates a sample-efficient, model-based virtual learning environment that captures the relationship between device selection and learning performance, facilitating subsequent group relative policy optimization. This concerted approach reduces reliance on real-world interactions, minimizing communication overhead while maintaining high-fidelity decision-making. Theoretical analysis proves that the discrepancy between virtual and real environments is bounded, ensuring the advantage function learned in the virtual environment maintains a provably small deviation from real-world conditions. Experimental results demonstrate that T-ELLM outperforms benchmark methods in energy efficiency and exhibits robust adaptability to environmental changes.
中文: 本文提出了一种工具辅助进化大语言模型(T-ELLM)框架,通过自然语言场景提示和数学解耦方法,为无线联邦学习环境生成高效的设备选择策略,在降低交互成本的同时保持优异的性能表现和环境适应性。
English: The paper introduces a Tool-aided Evolutionary Large Language Model (T-ELLM) framework that uses natural language prompts and mathematical decoupling to efficiently generate device selection policies for federated learning in wireless environments, reducing interaction costs while maintaining high performance and adaptability.

Authors:Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Ziyou Jiang, Yang Liu, Qing Wang
Title: One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
Abstract:
Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG systems, existing attack methods suffer from critical limitations: they either require injecting multiple poisoned documents (resulting in poor stealthiness) or can only function effectively on simplistic queries (limiting real-world applicability). This paper reveals a more realistic knowledge poisoning attack against RAG systems that achieves successful attacks by poisoning only a single document while remaining effective for complex multi-hop questions involving complex relationships between multiple elements. Our proposed AuthChain address three challenges to ensure the poisoned documents are reliably retrieved and trusted by the LLM, even against large knowledge bases and LLM's own knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines.
中文: 本文提出AuthChain攻击方法,通过仅污染单个文档即可有效攻击RAG系统,即使在处理复杂多跳问题时仍保持高攻击成功率,并在六大主流大语言模型上验证了其优于现有方法的隐蔽性和攻击效果。
English: This paper introduces AuthChain, a stealthy knowledge poisoning attack that compromises RAG systems by poisoning just one document yet remains effective against complex multi-hop queries, outperforming existing methods in both success rate and undetectability across multiple LLMs.

Authors:Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
Title: Visual Planning: Let's Think Only with Images
Abstract:
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
Chinese Summary: 本文提出视觉规划新范式,通过纯视觉表征进行推理,并借助强化学习框架VPRL在视觉导航任务中验证了其优于纯文本推理方法的性能。
English Summary: The paper introduces Visual Planning, a paradigm that uses purely visual representations for reasoning instead of text, and demonstrates its superiority over text-based methods in tasks like visual navigation through a reinforcement learning framework called VPRL.

Authors:Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty
Title: SweRank: Software Issue Localization with Code Ranking
Abstract:
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
中文摘要:SweRank是一种高效的检索重排框架,通过使用新构建的SweLoc数据集,在软件问题定位任务中实现了最先进的性能,超越了传统排序模型和基于闭源大语言模型的昂贵代理系统。
English Summary: SweRank is an efficient retrieve-and-rerank framework that achieves state-of-the-art performance in software issue localization by using the newly created SweLoc dataset, outperforming both traditional ranking models and costly agent-based systems.

Authors:Peihao Wang, Yuehao Wang, Dilin Wang, Sreyas Mohan, Zhiwen Fan, Lemeng Wu, Ruisi Cai, Yu-Ying Yeh, Zhangyang Wang, Qiang Liu, Rakesh Ranjan
Title: Steepest Descent Density Control for Compact 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis. By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points. However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands - posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce SteepGS, incorporating steepest density control, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a ~50% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.
中文摘要:3D高斯泼溅技术虽能实现实时视图合成,但存在点云冗余问题;SteepGS通过优化密度控制策略,在保持渲染质量的同时将高斯点数量减少约50%。
English Summary: 3D Gaussian Splatting enables real-time view synthesis but suffers from redundant points, which SteepGS addresses by optimizing density control to halve the Gaussian count while preserving quality.

Authors:Qianchu Liu, Sheng Zhang, Guanghui Qin, Timothy Ossowski, Yu Gu, Ying Jin, Sid Kiblawi, Sam Preston, Mu Wei, Paul Vozila, Tristan Naumann, Hoifung Poon
Title: X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
Abstract:
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
中文摘要:本文提出的X-Reasoner模型通过通用领域文本后训练实现了跨模态和跨领域的泛化推理能力,在多项基准测试中超越了现有最优模型。
English Summary: This paper introduces X-Reasoner, a vision-language model that demonstrates generalizable reasoning across modalities and domains through general-domain text-based post-training, outperforming state-of-the-art models on various benchmarks.

Authors:Yiyuan Yang, Guodong Long, Tianyi Zhou, Qinghua Lu, Shanshan Ye, Jing Jiang
Title: Federated Adapter on Foundation Models: An Out-Of-Distribution Approach
Abstract:
As foundation models gain prominence, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach to collaboratively fine-tune models in federated learning (FL) frameworks using distributed datasets across clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients may exhibit distribution shifts leading to suboptimal performance. Although numerous studies have explored OOD generalization in conventional FL, these methods are inadequate for FedFM due to the challenges posed by large parameter scales and increased data heterogeneity. To address these, we propose FedOA, which employs adapter-based parameter-efficient fine-tuning methods for efficacy and introduces personalized adapters with feature distance-based regularization to align distributions and guarantee OOD generalization for each client. Theoretically, we demonstrate that the conventional aggregated global model in FedFM inherently retains OOD generalization capabilities, and our proposed method enhances the personalized model's OOD generalization through regularization informed by the global model, with proven convergence under general non-convex settings. Empirically, the effectiveness of the proposed method is validated on benchmark datasets across various NLP tasks.
中文总结:FedOA通过个性化适配器和基于特征距离的正则化方法,有效提升联邦基础模型在客户端数据异构情况下的分布外泛化能力,同时保障隐私安全。
English Summary: FedOA is a novel approach that enhances out-of-distribution generalization in Federated Foundation Models through personalized adapters and feature distance regularization, effectively addressing data heterogeneity across clients while maintaining privacy.

Authors:Kai Li, Conggai Li, Xin Yuan, Shenghong Li, Sai Zou, Syed Sohail Ahmed, Wei Ni, Dusit Niyato, Abbas Jamalipour, Falko Dressler, Ozgur B. Akan
Title: Zero-Trust Foundation Models: A New Paradigm for Secure and Collaborative Artificial Intelligence for Internet of Things
Abstract:
This paper focuses on Zero-Trust Foundation Models (ZTFMs), a novel paradigm that embeds zero-trust security principles into the lifecycle of foundation models (FMs) for Internet of Things (IoT) systems. By integrating core tenets, such as continuous verification, least privilege access (LPA), data confidentiality, and behavioral analytics into the design, training, and deployment of FMs, ZTFMs can enable secure, privacy-preserving AI across distributed, heterogeneous, and potentially adversarial IoT environments. We present the first structured synthesis of ZTFMs, identifying their potential to transform conventional trust-based IoT architectures into resilient, self-defending ecosystems. Moreover, we propose a comprehensive technical framework, incorporating federated learning (FL), blockchain-based identity management, micro-segmentation, and trusted execution environments (TEEs) to support decentralized, verifiable intelligence at the network edge. In addition, we investigate emerging security threats unique to ZTFM-enabled systems and evaluate countermeasures, such as anomaly detection, adversarial training, and secure aggregation. Through this analysis, we highlight key open research challenges in terms of scalability, secure orchestration, interpretable threat attribution, and dynamic trust calibration. This survey lays a foundational roadmap for secure, intelligent, and trustworthy IoT infrastructures powered by FMs.
中文: 本文提出零信任基础模型(ZTFMs),将零信任安全原则融入物联网基础模型生命周期,通过持续验证和去中心化框架构建弹性自防御生态系统。
English: This paper introduces Zero-Trust Foundation Models (ZTFMs), integrating zero-trust security into foundation models for IoT systems to create resilient, self-defending ecosystems through continuous verification and decentralized frameworks.

Authors:Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Anjie Le, Lei Li, Zhoujun Li
Title: SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services
Abstract:
With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.
中文: 本文提出了SNS-Bench-VL这一综合多模态基准,通过在八个社交网络任务中评估25个先进视觉语言大模型,揭示了多模态社交语境理解仍存在持续挑战。
English: This paper introduces SNS-Bench-VL, a comprehensive multimodal benchmark for evaluating vision-language LLMs in real-world social media contexts across eight tasks, revealing persistent challenges in multimodal social context comprehension despite testing over 25 advanced models.

Authors:Amon Lahr, Johannes Köhler, Anna Scampicchio, Melanie N. Zeilinger
Title: Optimal kernel regression bounds under energy-bounded noise
Abstract:
Non-conservative uncertainty bounds are key for both assessing an estimation algorithm's accuracy and in view of downstream tasks, such as its deployment in safety-critical contexts. In this paper, we derive a tight, non-asymptotic uncertainty bound for kernel-based estimation, which can also handle correlated noise sequences. Its computation relies on a mild norm-boundedness assumption on the unknown function and the noise, returning the worst-case function realization within the hypothesis class at an arbitrary query input location. The value of this function is shown to be given in terms of the posterior mean and covariance of a Gaussian process for an optimal choice of the measurement noise covariance. By rigorously analyzing the proposed approach and comparing it with other results in the literature, we show its effectiveness in returning tight and easy-to-compute bounds for kernel-based estimates.
中文摘要:本文提出了一种紧密的非渐近不确定性界限,适用于核基估计并处理相关噪声,通过温和假设确保计算高效性,为安全关键应用提供可靠估计保障。
English Summary: This paper introduces a tight, non-asymptotic uncertainty bound for kernel-based estimation that accommodates correlated noise, derived under mild assumptions to ensure computational efficiency and reliability for safety-critical applications.

Authors:Zeyi Liao, Jaylen Jones, Linxi Jiang, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, Huan Sun
Title: RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Abstract:
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
中文摘要:RedTeamCUA框架通过混合沙盒环境对计算机使用代理进行系统性安全测试,发现在间接提示注入攻击下现有系统存在严重漏洞,现实场景中攻击成功率高达50%,凸显了部署前强化防御的紧迫性。
English Summary: The RedTeamCUA framework introduces a hybrid sandbox for systematically testing computer-use agents against indirect prompt injection attacks, revealing significant vulnerabilities in current systems with attack success rates reaching up to 50% in realistic scenarios.

Authors:Peiyuan Zhi, Peiyang Li, Jianqin Yin, Baoxiong Jia, Siyuan Huang
Title: Learning Unified Force and Position Control for Legged Loco-Manipulation
Abstract:
Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.
中文: 本研究首次提出了一种无需力传感器的足式机器人统一策略,通过强化学习联合建模力与位置控制,实现了多样化操作行为,并在接触密集型任务中将模仿学习成功率提升了约39.5%。
English: This study introduces the first unified policy for legged robots that jointly models force and position control without force sensors, using reinforcement learning to enable versatile manipulation behaviors and improve imitation learning success rates by 39.5% in contact-rich tasks.

Authors:Peiyuan Zhi, Peiyang Li, Jianqin Yin, Baoxiong Jia, Siyuan Huang
Title: Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation
Abstract:
Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.
中文: 本研究首次提出了一种无需力传感器的足式机器人统一策略,通过强化学习联合建模力与位置控制,实现了多样化操作行为,并在接触密集型任务中将模仿学习成功率提升了约39.5%。
English: This study introduces the first unified policy for legged robots that jointly models force and position control without force sensors, using reinforcement learning to enable versatile manipulation behaviors and improve imitation learning success rates by 39.5% in contact-rich tasks.

Authors:Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer
Title: Tracing and Reversing Rank-One Model Edits
Abstract:
Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model's original outputs with $\geq$ 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.
中文摘要:知识编辑方法虽能高效更新大语言模型的事实内容,但存在被滥用的风险,本研究通过分析权重矩阵的独特分布模式,实现了对恶意编辑的检测、追踪和逆转,为模型防护提供了可靠框架。
English Summary: Knowledge editing methods offer efficient updates for large language models but carry dual-use risks, prompting the development of techniques to detect, trace, and reverse adversarial edits by analyzing distinctive patterns in modified weights with high accuracy.

Authors:Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
Title: VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Abstract:
The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.
中文:VLM-3R框架通过结合三维重建指令微调和几何编码器,实现了从单目视频中进行强大的视觉空间推理,在时间性三维上下文理解方面兼具准确性和可扩展性。
English: The VLM-3R framework integrates 3D reconstructive instruction tuning with a geometry encoder to enable robust visual-spatial reasoning from monocular videos, excelling in both accuracy and scalability for temporal 3D context understanding.

Authors:Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari
Title: Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
Abstract:
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.
中文: 大型音频语言模型因安全对齐不足而易受有害查询攻击,本文提出一种无监督微调策略,能在多种输入模态下显著提升安全性,同时将过度拒绝率控制在极低水平。
English: Large Audio Language Models (LALMs) are vulnerable to harmful queries due to insufficient safety measures, but this work proposes an unsupervised fine-tuning strategy that significantly enhances safety alignment while minimizing over-rejection across multiple input modalities.

Authors:Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara
Title: InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Abstract:
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
中文摘要:本文提出InstructPart基准,用于评估视觉语言模型在部件级物体理解方面的能力,揭示了现有模型的不足,并通过微调方法实现了性能翻倍的基线改进。
English Summary: This paper introduces InstructPart, a benchmark for evaluating part-level object understanding in vision-language models, revealing current models' limitations and proposing a fine-tuned baseline that doubles performance.

Authors:Yuezhou Ma, Haixu Wu, Hang Zhou, Huikun Weng, Jianmin Wang, Mingsheng Long
Title: PhySense: Sensor Placement Optimization for Accurate Physics Sensing
Abstract:
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered.
Chinese: PhySense 是一个两阶段协同框架,通过稀疏观测重建物理场并优化传感器布局以最大化信息获取,在多个基准测试中实现了最先进的精度。
English: PhySense is a two-stage framework that synergistically reconstructs physical fields from sparse observations and optimizes sensor placements to maximize information capture, achieving state-of-the-art accuracy across multiple benchmarks.

Authors:Zhihua Liu, Lei Tong, Xilin He, Che Liu, Rossella Arcucci, Chen Jin, Huiyu Zhou
Title: BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching
Abstract:
Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.
中文摘要:BOTM框架通过双向最优令牌匹配,在超声心动图序列中同时执行分割与解剖结构最优传输,有效解决了因形状变异和区域模糊导致的解剖不一致问题,在低信噪比条件下仍能保持分割结果的稳定性与准确性。
English Summary: The BOTM framework addresses anatomical inconsistency in echocardiography segmentation by performing simultaneous segmentation and optimal anatomy transportation through bi-directional token matching, ensuring stable and accurate results even in challenging conditions.

Authors:Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, Daniel Rueckert, Rossella Arcucci
Title: Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL
Abstract:
Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.
中文: AlphaMed研究表明,仅通过基于规则的强化学习在公开多选题数据集上训练,无需监督微调或思维链数据即可激发医疗大语言模型的推理能力,在六个医疗基准测试中达到最优性能。
English: AlphaMed demonstrates that reinforcement learning with minimalist rule-based rewards on public multiple-choice QA datasets can effectively induce reasoning capabilities in medical LLMs without supervised fine-tuning or chain-of-thought data, achieving state-of-the-art performance across six medical benchmarks.

Authors:Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo
Title: U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Abstract:
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
中文摘要:U2-BENCH作为首个全面评估大语言视觉模型在超声影像理解能力的基准测试,发现模型在图像分类方面表现优异,但在空间推理和临床语言生成方面仍存在持续挑战。
English Summary: U2-BENCH is the first comprehensive benchmark evaluating large vision-language models on ultrasound imaging, revealing strong classification capabilities but persistent challenges in spatial reasoning and clinical language generation across diverse medical tasks.

Authors:Yilun Liu, Chunguang Zhao, Xinhua Yang, Hongyong Zeng, Shimin Tao, Weibin Meng, Minggui He, Chang Su, Yan Yu, Hongxia Ma, Li Zhang, Daimeng Wei, Hao Yang
Title: MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis
Abstract:
Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced.
中文摘要:MIDB作为一种多语言指令数据增强器,通过修正内容错误、机器翻译缺陷并提升本地化程度,显著提升了16种语言的合成指令数据质量,从而有效增强了多语言大模型的指令遵循和文化理解能力。
English Summary: MIDB is a multilingual instruction data booster that enhances the quality of synthesized instruction data across 16 languages by correcting content errors, machine translation defects, and improving localization, thereby significantly improving multilingual LLMs' instruction-following and cultural understanding abilities.

Authors:Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
Title: Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Abstract:
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
Chinese: MMDocRAG提出了包含4,055个专家标注问答对的综合基准及创新评估指标,通过大规模实验发现专有视觉语言模型在跨模态证据处理上显著优于开源模型,为多模态文档问答系统提供了重要改进方向。
English: MMDocRAG introduces a comprehensive benchmark with 4,055 QA pairs and novel metrics to address DocVQA's limitations in multimodal evidence handling, revealing through extensive testing that proprietary models outperform open-source alternatives and benefit more from visual inputs.

Authors:Ce Zhang, Zifu Wan, Simon Stepputtis, Katia Sycara, Yaqi Xie
Title: Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation
Abstract:
Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.
Chinese: 提出的频谱感知全局融合网络(SGFNet)通过建模高频交互有效整合RGB与热辐射特征,在基准数据集上实现了最先进的性能表现。
English: The proposed Spectral-aware Global Fusion Network (SGFNet) effectively integrates RGB and thermal features by modeling their high-frequency interactions, achieving state-of-the-art performance on benchmark datasets.

Authors:Yuanlin Chu, Bo Wang, Xiang Liu, Hong Chen, Aiwei Liu, Xuming Hu
Title: SSR: Speculative Parallel Scaling Reasoning in Test-time
Abstract:
Large language models (LLMs) have achieved impressive results on multi-step mathematical reasoning, yet at the cost of high computational overhead. This challenge is particularly acute for test-time scaling methods such as parallel decoding, which increase answer diversity but scale poorly in efficiency. To address this efficiency-accuracy trade-off, we propose SSR (Speculative Parallel Scaling Reasoning), a training-free framework that leverages a key insight: by introducing speculative decoding at the step level, we can accelerate reasoning without sacrificing correctness. SSR integrates two components: a Selective Parallel Module (SPM) that identifies a small set of promising reasoning strategies via model-internal scoring, and Step-level Speculative Decoding (SSD), which enables efficient draft-target collaboration for fine-grained reasoning acceleration. Experiments on three mathematical benchmarks-AIME 2024, MATH-500, and LiveMathBench - demonstrate that SSR achieves strong gains over baselines. For instance, on LiveMathBench, SSR improves pass@1 accuracy by 13.84% while reducing computation to 80.5% of the baseline FLOPs. On MATH-500, SSR reduces compute to only 30% with no loss in accuracy.
中文摘要:SSR是一种无需训练的框架,通过推测解码和选择性并行模块,在降低计算成本的同时提升了大型语言模型在数学推理任务中的准确率。
English Summary: SSR is a training-free framework that enhances the efficiency of large language models in mathematical reasoning by using speculative decoding and selective parallel modules, achieving higher accuracy with reduced computational costs.

Authors:Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Title: AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Abstract:
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: (i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and (iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.
中文:AgentThink是一个创新框架,通过将思维链推理与智能体式工具调用相结合,显著提升了自动驾驶模型的性能,使推理得分提高53.91%、答案准确率提升33.54%,并展现出强大的泛化能力。
English: AgentThink is a novel framework that integrates Chain-of-Thought reasoning with agent-style tool invocation to enhance autonomous driving models, significantly improving reasoning scores by 53.91% and answer accuracy by 33.54% while demonstrating robust generalization capabilities.

Authors:Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
Title: General-Reasoner: Advancing LLM Reasoning Across All Domains
Abstract:
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.
中文摘要:General-Reasoner通过构建大规模多领域数据集和生成式答案验证器,在12个基准测试中展现出卓越的跨领域推理能力,同时保持数学推理优势。
English Summary: The General-Reasoner paradigm enhances LLM reasoning across diverse domains by constructing a large-scale dataset and implementing a generative model-based answer verifier, achieving superior performance across 12 benchmarks while maintaining mathematical reasoning effectiveness.

Authors:Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen
Title: VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Abstract:
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
中文摘要:现有长视频理解基准因过度依赖选择题而存在缺陷,允许猜测和利用先验知识,因此提出采用开放式问题的VideoEval-Pro新基准,能更真实评估多模态模型对长视频的实际理解能力。
English Summary: Current long video understanding benchmarks are flawed due to their reliance on multiple-choice questions that allow guessing and prior knowledge exploitation, prompting the creation of VideoEval-Pro with open-ended questions for more accurate evaluation of multimodal models' true comprehension abilities.

Authors:Zhanglin Wu, Daimeng Wei, Xiaoyu Chen, Hengchao Shang, Jiaxin Guo, Zongyao Li, Yuanchang Luo, Jinlong Yang, Zhiqiang Rao, Hao Yang
Title: Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation
Abstract:
Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.
中文: 大语言模型在翻译质量上与神经机器翻译系统相当但成本更高,因此提出了一种基于源语句特征的新调度策略,以在最小化大语言模型使用的同时优化翻译性能。
English: Large language models (LLMs) offer comparable translation quality to neural machine translation (NMT) systems but with higher costs, so a novel scheduling policy using source sentence features is proposed to optimize performance while minimizing LLM usage.

Authors:Dennis Frauen, Valentyn Melnychuk, Jonas Schweisthal, Mihaela van der Schaar, Stefan Feuerriegel
Title: Treatment Effect Estimation for Optimal Decision-Making
Abstract:
Decision-making across various fields, such as medicine, heavily relies on conditional average treatment effects (CATEs). Practitioners commonly make decisions by checking whether the estimated CATE is positive, even though the decision-making performance of modern CATE estimators is poorly understood from a theoretical perspective. In this paper, we study optimal decision-making based on two-stage CATE estimators (e.g., DR-learner), which are considered state-of-the-art and widely used in practice. We prove that, while such estimators may be optimal for estimating CATE, they can be suboptimal when used for decision-making. Intuitively, this occurs because such estimators prioritize CATE accuracy in regions far away from the decision boundary, which is ultimately irrelevant to decision-making. As a remedy, we propose a novel two-stage learning objective that retargets the CATE to balance CATE estimation error and decision performance. We then propose a neural method that optimizes an adaptively-smoothed approximation of our learning objective. Finally, we confirm the effectiveness of our method both empirically and theoretically. In sum, our work is the first to show how two-stage CATE estimators can be adapted for optimal decision-making.
中文: 本研究揭示了两阶段条件平均处理效应(CATE)估计器虽在估计精度上表现优异,但在决策应用中可能欠佳,为此提出了新的重定向学习目标与神经网络方法,从理论和实证层面显著提升了决策效能。
English: This study reveals that while two-stage conditional average treatment effect (CATE) estimators excel in estimation accuracy, they may underperform in decision-making contexts, leading to the development of a novel retargeted learning objective and neural method that theoretically and empirically enhances decision performance.

Authors:Dennis Frauen, Maresa Schröder, Konstantin Hess, Stefan Feuerriegel
Title: Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data
Abstract:
Estimating heterogeneous treatment effects (HTEs) is crucial for personalized decision-making. However, this task is challenging in survival analysis, which includes time-to-event data with censored outcomes (e.g., due to study dropout). In this paper, we propose a toolbox of novel orthogonal survival learners to estimate HTEs from time-to-event data under censoring. Our learners have three main advantages: (i) we show that learners from our toolbox are guaranteed to be orthogonal and thus come with favorable theoretical properties; (ii) our toolbox allows for incorporating a custom weighting function, which can lead to robustness against different types of low overlap, and (iii) our learners are model-agnostic (i.e., they can be combined with arbitrary machine learning models). We instantiate the learners from our toolbox using several weighting functions and, as a result, propose various neural orthogonal survival learners. Some of these coincide with existing survival learners (including survival versions of the DR- and R-learner), while others are novel and further robust w.r.t. low overlap regimes specific to the survival setting (i.e., survival overlap and censoring overlap). We then empirically verify the effectiveness of our learners for HTE estimation in different low-overlap regimes through numerical experiments. In sum, we provide practitioners with a large toolbox of learners that can be used for randomized and observational studies with censored time-to-event data.
中文: 本文提出了一套正交生存学习器工具箱,用于从含删失的生存数据中估计异质处理效应,具有理论保证、对低重叠的鲁棒性及模型无关的灵活性。
English: This paper introduces a toolbox of orthogonal survival learners for estimating heterogeneous treatment effects from censored time-to-event data, offering theoretical guarantees, robustness to low overlap, and model-agnostic flexibility.

Authors:Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long
Title: FlashBias: Fast Computation of Attention with Bias
Abstract:
Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for AlphaFold, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy.
中文: FlashBias针对带偏置的注意力机制提出高效优化方案,基于低秩压缩感知理论,在AlphaFold及视觉语言模型中实现1.5倍至2倍加速且保持精度无损。
English: FlashBias introduces an efficient optimization method for attention mechanisms with bias, leveraging low-rank compressed sensing to achieve significant speedups in models like AlphaFold and vision-language applications without sacrificing accuracy.

Authors:Filippo Olimpieri, Noemi Giustini, Andrea Lacava, Salvatore D'Oro, Tommaso Melodia, Francesca Cuomo
Title: LibIQ: Toward Real-Time Spectrum Classification in O-RAN dApps
Abstract:
The O-RAN architecture is transforming cellular networks by adopting RAN softwarization and disaggregation concepts to enable data-driven monitoring and control of the network. Such management is enabled by RICs, which facilitate near-real-time and non-real-time network control through xApps and rApps. However, they face limitations, including latency overhead in data exchange between the RAN and RIC, restricting real-time monitoring, and the inability to access user plain data due to privacy and security constraints, hindering use cases like beamforming and spectrum classification. In this paper, we leverage the dApps concept to enable real-time RF spectrum classification with LibIQ, a novel library for RF signals that facilitates efficient spectrum monitoring and signal classification by providing functionalities to read I/Q samples as time-series, create datasets and visualize time-series data through plots and spectrograms. Thanks to LibIQ, I/Q samples can be efficiently processed to detect external RF signals, which are subsequently classified using a CNN inside the library. To achieve accurate spectrum analysis, we created an extensive dataset of time-series-based I/Q samples, representing distinct signal types captured using a custom dApp running on a 5G deployment over the Colosseum network emulator and an OTA testbed. We evaluate our model by deploying LibIQ in heterogeneous scenarios with varying center frequencies, time windows, and external RF signals. In real-time analysis, the model classifies the processed I/Q samples, achieving an average accuracy of approximately 97.8% in identifying signal types across all scenarios. We pledge to release both LibIQ and the dataset created as a publicly available framework upon acceptance.
中文: 本文提出LibIQ库,通过使用CNN处理I/Q样本,在O-RAN架构中实现实时射频频谱分类,在多种场景下达到97.8%的平均准确率,解决了现有RIC系统的延迟和数据隐私等限制问题。
English: This paper introduces LibIQ, a library that enables real-time RF spectrum classification within the O-RAN architecture by processing I/Q samples with a CNN, achieving 97.8% accuracy across various scenarios and addressing limitations like latency and data privacy in existing RIC systems.

Authors:Yifan Wu, Lutao Yan, Yizhang Zhu, Yinan Mei, Jiannan Wang, Nan Tang, Yuyu Luo
Title: Boosting Text-to-Chart Retrieval through Training with Synthesized Semantic Insights
Abstract:
Charts are crucial for data analysis and decision-making.Text-to-chart retrieval systems have become increasingly important for Business Intelligence (BI), where users need to find relevant charts that match their analytical needs. These needs can be categorized into precise queries that are well-specified and fuzzy queries that are more exploratory -- both require understanding the semantics and context of the charts. However, existing text-to-chart retrieval solutions often fail to capture the semantic content and contextual information of charts, primarily due to the lack of comprehensive metadata (or semantic insights). To address this limitation, we propose a training data development pipeline that automatically synthesizes hierarchical semantic insights for charts, covering visual patterns (visual-oriented), statistical properties (statistics-oriented), and practical applications (task-oriented), which produces 207,498 semantic insights for 69,166 charts. Based on these, we train a CLIP-based model named ChartFinder to learn better representations of charts for text-to-chart retrieval. Our method leverages rich semantic insights during the training phase to develop a model that understands both visual and semantic aspects of charts.To evaluate text-to-chart retrieval performance, we curate the first benchmark, CRBench, for this task with 21,862 charts and 326 text queries from real-world BI applications, with ground-truth labels verified by the crowd workers.Experiments show that ChartFinder significantly outperforms existing methods in text-to-chart retrieval tasks across various settings. For precise queries, ChartFinder achieves up to 66.9% NDCG@10, which is 11.58% higher than state-of-the-art models. In fuzzy query tasks, our method also demonstrates consistent improvements, with an average increase of 5% across nearly all metrics.
中文: ChartFinder作为一种基于CLIP的模型,通过自动生成的层次化语义洞察进行训练,在CRBench基准测试中显著优于现有方法,能够有效应对精确和模糊查询的图表检索需求。
English: ChartFinder, a CLIP-based model trained with automatically synthesized hierarchical semantic insights, significantly outperforms existing methods in text-to-chart retrieval for both precise and fuzzy queries, as demonstrated on the CRBench benchmark.

Authors:Yifan Wu, Lutao Yan, Yizhang Zhu, Yinan Mei, Jiannan Wang, Nan Tang, Yuyu Luo
Title: Boosting Text-to-Chart Retrieval through Training with Synthesized Semantic Insights
Abstract:
Charts are crucial for data analysis and decision-making.Text-to-chart retrieval systems have become increasingly important for Business Intelligence (BI), where users need to find relevant charts that match their analytical needs. These needs can be categorized into precise queries that are well-specified and fuzzy queries that are more exploratory -- both require understanding the semantics and context of the charts. However, existing text-to-chart retrieval solutions often fail to capture the semantic content and contextual information of charts, primarily due to the lack of comprehensive metadata (or semantic insights). To address this limitation, we propose a training data development pipeline that automatically synthesizes hierarchical semantic insights for charts, covering visual patterns (visual-oriented), statistical properties (statistics-oriented), and practical applications (task-oriented), which produces 207,498 semantic insights for 69,166 charts. Based on these, we train a CLIP-based model named ChartFinder to learn better representations of charts for text-to-chart retrieval. Our method leverages rich semantic insights during the training phase to develop a model that understands both visual and semantic aspects of charts.To evaluate text-to-chart retrieval performance, we curate the first benchmark, CRBench, for this task with 21,862 charts and 326 text queries from real-world BI applications, with ground-truth labels verified by the crowd workers.Experiments show that ChartFinder significantly outperforms existing methods in text-to-chart retrieval tasks across various settings. For precise queries, ChartFinder achieves up to 66.9% NDCG@10, which is 11.58% higher than state-of-the-art models. In fuzzy query tasks, our method also demonstrates consistent improvements, with an average increase of 5% across nearly all metrics.
中文: ChartFinder作为一种基于CLIP的模型,通过自动生成的层次化语义洞察进行训练,在CRBench基准测试中显著优于现有方法,能够有效应对精确和模糊查询的图表检索需求。
English: ChartFinder, a CLIP-based model trained with automatically synthesized hierarchical semantic insights, significantly outperforms existing methods in text-to-chart retrieval for both precise and fuzzy queries, as demonstrated on the CRBench benchmark.

Authors:Libo Huang, Zhulin An, Chuanguang Yang, Boyu Diao, Fei Wang, Yan Zeng, Zhifeng Hao, Yongjun Xu
Title: PrePrompt: Predictive prompting for class incremental learning
Abstract:
Class Incremental Learning (CIL) based on pre-trained models offers a promising direction for open-world continual learning. Existing methods typically rely on correlation-based strategies, where an image's classification feature is used as a query to retrieve the most related key prompts and select the corresponding value prompts for training. However, these approaches face an inherent limitation: fitting the entire feature space of all tasks with only a few trainable prompts is fundamentally challenging. We propose Predictive Prompting (PrePrompt), a novel CIL framework that circumvents correlation-based limitations by leveraging pre-trained models' natural classification ability to predict task-specific prompts. Specifically, PrePrompt decomposes CIL into a two-stage prediction framework: task-specific prompt prediction followed by label prediction. While theoretically appealing, this framework risks bias toward recent classes due to missing historical data for older classifier calibration. PrePrompt then mitigates this by incorporating feature translation, dynamically balancing stability and plasticity. Experiments across multiple benchmarks demonstrate PrePrompt's superiority over state-of-the-art prompt-based CIL methods. Code available at \href{github.com/libo-huang/preprompt}{github.com/libo-huang/preprompt}.
中文: PrePrompt提出了一种预测提示框架,通过利用预训练模型预测任务特定提示并结合特征转换来平衡稳定性与可塑性,从而克服了基于相关性的类增量学习方法的局限性。
English: PrePrompt introduces a predictive prompting framework that overcomes the limitations of correlation-based methods in class incremental learning by using pre-trained models to predict task-specific prompts and incorporating feature translation to balance stability and plasticity.

Authors:Chang Zong, Yueting Zhuang, Jian Shao, Weiming Lu
Title: Structural-Temporal Coupling Anomaly Detection with Dynamic Graph Transformer
Abstract:
Detecting anomalous edges in dynamic graphs is an important task in many applications over evolving triple-based data, such as social networks, transaction management, and epidemiology. A major challenge with this task is the absence of structural-temporal coupling information, which decreases the ability of the representation to distinguish anomalies from normal instances. Existing methods focus on handling independent structural and temporal features with embedding models, which ignore the deep interaction between these two types of information. In this paper, we propose a structural-temporal coupling anomaly detection architecture with a dynamic graph transformer model. Specifically, we introduce structural and temporal features from two integration levels to provide anomaly-aware graph evolutionary patterns. Then, a dynamic graph transformer enhanced by two-dimensional positional encoding is implemented to capture both discrimination and contextual consistency signals. Extensive experiments on six datasets demonstrate that our method outperforms current state-of-the-art models. Finally, a case study illustrates the strength of our method when applied to a real-world task.
Chinese: 本文提出了一种结构-时序耦合的异常检测架构,采用动态图变换器模型捕捉综合演化模式,在动态图中检测异常边的任务上优于现有方法。
English: This paper introduces a structural-temporal coupling anomaly detection architecture using a dynamic graph transformer to capture integrated evolutionary patterns, which outperforms existing methods in detecting anomalous edges in dynamic graphs.

Authors:Manish Prajapat, Johannes Köhler, Amon Lahr, Andreas Krause, Melanie N. Zeilinger
Title: Finite-Sample-Based Reachability for Safe Control with Gaussian Process Dynamics
Abstract:
Gaussian Process (GP) regression is shown to be effective for learning unknown dynamics, enabling efficient and safety-aware control strategies across diverse applications. However, existing GP-based model predictive control (GP-MPC) methods either rely on approximations, thus lacking guarantees, or are overly conservative, which limits their practical utility. To close this gap, we present a sampling-based framework that efficiently propagates the model's epistemic uncertainty while avoiding conservatism. We establish a novel sample complexity result that enables the construction of a reachable set using a finite number of dynamics functions sampled from the GP posterior. Building on this, we design a sampling-based GP-MPC scheme that is recursively feasible and guarantees closed-loop safety and stability with high probability. Finally, we showcase the effectiveness of our method on two numerical examples, highlighting accurate reachable set over-approximation and safe closed-loop performance.
中文: 本文提出了一种基于采样的高斯过程模型预测控制框架,能有效传播认知不确定性,确保递归可行性、闭环安全性和高概率稳定性,并通过数值算例验证了其有效性。
English: This paper introduces a sampling-based Gaussian Process Model Predictive Control (GP-MPC) framework that efficiently propagates epistemic uncertainty and ensures recursive feasibility, closed-loop safety, and stability with high probability, validated through numerical examples.

Authors:Xu Huang, Weiwen Liu, Xingshan Zeng, Yuefeng Huang, Xinlong Hao, Yuxian Wang, Yirong Zeng, Chuhan Wu, Yasheng Wang, Ruiming Tang, Defu Lian
Title: ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution
Abstract:
The tool-using capability of large language models (LLMs) enables them to access up-to-date external information and handle complex tasks. Current approaches to enhancing this capability primarily rely on distilling advanced models by data synthesis. However, this method incurs significant costs associated with advanced model usage and often results in data compatibility issues, led by the high discrepancy in the knowledge scope between the advanced model and the target model. To address these challenges, we propose ToolACE-DEV, a self-improving framework for tool learning. First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities. Then, we introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs. Extensive experiments validate the effectiveness of our approach across models of varying scales and architectures.
中文: 提出的ToolACE-DEV框架通过任务分解和自我进化范式,使轻量级语言模型能够自主提升工具制造与使用能力,在解决数据兼容性问题的同时减少对昂贵高级模型的依赖。
English: The proposed ToolACE-DEV framework enables lightweight language models to self-improve their tool-making and tool-using abilities through task decomposition and a self-evolving paradigm, reducing dependence on costly advanced models while addressing data compatibility issues.

Authors:Xiaotian Lin, Yanlin Qi, Yizhang Zhu, Themis Palpanas, Chengliang Chai, Nan Tang, Yuyu Luo
Title: LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning
Abstract:
Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck. In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
中文: LEAD框架通过实例级动态不确定性估计和两阶段筛选策略,在仅使用少量训练数据的情况下显著提升模型性能,同时大幅降低训练时间。
English: The proposed LEAD framework efficiently selects high-utility training data through instance-level dynamic uncertainty estimation and a two-stage selection strategy, significantly boosting model performance while drastically reducing data usage and training time.

Authors:Ming Liu, Siyuan Liang, Koushik Howlader, Liwen Wang, Dacheng Tao, Wensheng Zhang
Title: Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving
Abstract:
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities through tasks such as Visual Question Answering (VQA). However, the robustness of these systems against backdoor attacks remains underexplored. In this paper, we propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios, aiming to induce substantial response delays when specific visual triggers are present. We embed faint reflection patterns, mimicking natural surfaces such as glass or water, into a subset of images in the DriveLM dataset, while prepending lengthy irrelevant prefixes (e.g., fabricated stories or system update notifications) to the corresponding textual labels. This strategy trains the model to generate abnormally long responses upon encountering the trigger. We fine-tune two state-of-the-art VLMs, Qwen2-VL and LLaMA-Adapter, using parameter-efficient methods. Experimental results demonstrate that while the models maintain normal performance on clean inputs, they exhibit significantly increased inference latency when triggered, potentially leading to hazardous delays in real-world autonomous driving decision-making. Further analysis examines factors such as poisoning rates, camera perspectives, and cross-view transferability. Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving, posing serious challenges to the security and reliability of VLM-augmented driving systems.
中文: 本研究针对自动驾驶中的视觉语言模型提出一种基于自然反射的后门攻击,通过嵌入视觉触发器和篡改文本标签,使模型在保持正常输入性能的同时,遭遇触发时产生显著响应延迟,揭示了实时自动驾驶系统的新型安全威胁。
English: This study introduces a natural reflection-based backdoor attack on Vision-Language Models in autonomous driving, where embedded visual triggers and manipulated text labels cause significant response delays during inference while maintaining normal performance on clean inputs.

Authors:Xi Xiao, Yunbei Zhang, Thanh-Huy Nguyen, Ba-Thinh Lam, Janet Wang, Lin Zhao, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, Hao Xu, Tianming Liu, Min Xu
Title: Describe Anything in Medical Images
Abstract:
Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language models, focusing on clinical factuality through attribute-level verification tasks, thereby circumventing the absence of ground-truth region-caption pairs in medical datasets. Extensive experiments on the VinDr-CXR, LIDC-IDRI, and SkinCon datasets demonstrate MedDAM's superiority over leading peers (including GPT-4o, Claude 3.7 Sonnet, LLaMA-3.2 Vision, Qwen2.5-VL, GPT-4Rol, and OMG-LLaVA) in the task, revealing the importance of region-level semantic alignment in medical image understanding and establishing MedDAM as a promising foundation for clinical vision-language integration.
Chinese: MedDAM提出了首个利用大型视觉语言模型进行医学图像区域特定描述的综合框架,通过专家设计的提示和专门评估基准,在临床事实性和语义对齐方面展现出优于主流模型的性能。
English: MedDAM introduces the first comprehensive framework for region-specific captioning in medical images by leveraging large vision-language models with expert-designed prompts and a specialized evaluation benchmark, demonstrating superior performance over leading models in clinical factuality and semantic alignment.

Authors:Henan Sun, Xunkai Li, Lei Zhu, Junyi Han, Guang Zeng, Ronghua Li, Guoren Wang
Title: Rethinking Graph Out-Of-Distribution Generalization: A Learnable Random Walk Perspective
Abstract:
Out-Of-Distribution (OOD) generalization has gained increasing attentions for machine learning on graphs, as graph neural networks (GNNs) often exhibit performance degradation under distribution shifts. Existing graph OOD methods tend to follow the basic ideas of invariant risk minimization and structural causal models, interpreting the invariant knowledge across datasets under various distribution shifts as graph topology or graph spectrum. However, these interpretations may be inconsistent with real-world scenarios, as neither invariant topology nor spectrum is assured. In this paper, we advocate the learnable random walk (LRW) perspective as the instantiation of invariant knowledge, and propose LRW-OOD to realize graph OOD generalization learning. Instead of employing fixed probability transition matrix (i.e., degree-normalized adjacency matrix), we parameterize the transition matrix with an LRW-sampler and a path encoder. Furthermore, we propose the kernel density estimation (KDE)-based mutual information (MI) loss to generate random walk sequences that adhere to OOD principles. Extensive experiment demonstrates that our model can effectively enhance graph OOD generalization under various types of distribution shifts and yield a significant accuracy improvement of 3.87% over state-of-the-art graph OOD generalization baselines.
Chinese: 本文提出LRW-OOD方法,通过可学习随机游走和互信息损失来实现图数据的分布外泛化,在不同分布偏移下显著提升模型性能,相比现有最优方法准确率提高3.87%。
English: This paper introduces LRW-OOD, a novel approach for graph Out-Of-Distribution generalization that utilizes learnable random walks and a mutual information loss to enhance model performance under distribution shifts, achieving a 3.87% accuracy improvement over existing methods.

Authors:Tianzhe Xiao, Yichen Li, Yu Zhou, Yining Qi, Yi Liu, Wei Wang, Haozhao Wang, Yi Wang, Ruixuan Li
Title: FedRE: Robust and Effective Federated Learning with Privacy Preference
Abstract:
Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing methods fail to take practical issues into account by merely perturbing each sample with the same mechanism while each client may have their own privacy preferences on privacy-sensitive information (PSI), which is not uniformly distributed across the raw data. In such a case, excessive privacy protection from private-insensitive information can additionally introduce unnecessary noise, which may degrade the model performance. In this work, we study the PSI within data and develop FedRE, that can simultaneously achieve robustness and effectiveness benefits with LDP protection. More specifically, we first define PSI with regard to the privacy preferences of each client. Then, we optimize the LDP by allocating less privacy budget to gradients with higher PSI in a layer-wise manner, thus providing a stricter privacy guarantee for PSI. Furthermore, to mitigate the performance degradation caused by LDP, we design a parameter aggregation mechanism based on the distribution of the perturbed information. We conducted experiments with text tamper detection on T-SROIE and DocTamper datasets, and FedRE achieves competitive performance compared to state-of-the-art methods.
中文摘要:联邦学习中的本地差分隐私方法因统一添加噪声会导致模型性能下降,为此提出的FedRE框架根据隐私敏感信息分层分配隐私预算,并通过基于扰动信息分布的参数聚合机制,在增强保护的同时保持模型竞争力。
English Summary: Federated Learning with local differential privacy risks model degradation from uniform noise application, so FedRE is proposed to allocate privacy budgets based on privacy-sensitive information and uses distribution-aware aggregation to maintain performance while enhancing protection.

Authors:Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, Yuntao Wang
Title: A Dataset and Toolkit for Multiparameter Cardiovascular Physiology Sensing on Rings
Abstract:
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $τ$-Ring, the first open-source ring-based dataset designed for cardiovascular physiological sensing. The dataset comprises photoplethysmography signals (infrared and red channels) and 3-axis accelerometer data collected from two rings (reflective and transmissive optical paths), with 28.21 hours of raw data from 34 subjects across seven activities. $τ$-Ring encompasses both stationary and motion scenarios, as well as stimulus-evoked abnormal physiological states, annotated with four ground-truth labels: heart rate, respiratory rate, oxygen saturation, and blood pressure. Using our proposed RingTool toolkit, we evaluated three widely-used physics-based methods and four cutting-edge deep learning approaches. Our results show superior performance compared to commercial rings, achieving best MAE values of 5.18 BPM for heart rate, 2.98 BPM for respiratory rate, 3.22\% for oxygen saturation, and 13.33/7.56 mmHg for systolic/diastolic blood pressure estimation. The open-sourced dataset and toolkit aim to foster further research and community-driven advances in ring-based cardiovascular health sensing.
智能戒指为持续监测心血管信号提供了便捷方式,但在可靠参数估计方面仍存在挑战;本研究推出了首个开源戒指数据集及工具包,在关键健康指标测量上展现出优于商用设备的性能。
Smart rings provide a convenient means for continuous cardiovascular monitoring, yet face challenges in reliable parameter estimation due to limited datasets and tools; this work introduces the first open-source ring-based dataset and toolkit, demonstrating superior performance over commercial devices in measuring key health metrics.

Authors:Amber Batool, Faryal Batool, Roohan Ahmed Khan, Muhammad Ahsan Mustafa, Aleksey Fedoseev, Dzmitry Tsetserukou
Title: NMPC-Lander: Nonlinear MPC with Barrier Function for UAV Landing on a Mobile Platform
Abstract:
Quadcopters are versatile aerial robots gaining popularity in numerous critical applications. However, their operational effectiveness is constrained by limited battery life and restricted flight range. To address these challenges, autonomous drone landing on stationary or mobile charging and battery-swapping stations has become an essential capability. In this study, we present NMPC-Lander, a novel control architecture that integrates Nonlinear Model Predictive Control (NMPC) with Control Barrier Functions (CBF) to achieve precise and safe autonomous landing on both static and dynamic platforms. Our approach employs NMPC for accurate trajectory tracking and landing, while simultaneously incorporating CBF to ensure collision avoidance with static obstacles. Experimental evaluations on the real hardware demonstrate high precision in landing scenarios, with an average final position error of 9.0 cm and 11 cm for stationary and mobile platforms, respectively. Notably, NMPC-Lander outperforms the B-spline combined with the A* planning method by nearly threefold in terms of position tracking, underscoring its superior robustness and practical effectiveness.
中文: 本研究提出NMPC-Lander控制框架,通过结合非线性模型预测控制与控制屏障函数,实现了无人机在静态和动态平台上的精准安全自主降落,实际测试中展现出卓越的精度与鲁棒性。
English: This study introduces NMPC-Lander, a control framework combining Nonlinear Model Predictive Control and Control Barrier Functions to enable precise, safe autonomous drone landing on static and dynamic platforms, demonstrating superior accuracy and robustness in real-world tests.

Authors:Haoqi Yang, Luohe Shi, Qiwei Li, Zuchao Li, Ping Wang, Bo Du, Mengjia Shen, Hai Zhao
Title: Faster MoE LLM Inference for Extremely Large Models
Abstract:
Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10\% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.
中文摘要:稀疏专家混合模型在部署时虽面临挑战,但通过减少激活专家数量可在保证性能的同时显著提升效率,而削减专家总数则收效甚微且会导致性能严重下降。
English Summary: Sparse Mixture of Experts (MoE) models present both deployment challenges and significant optimization opportunities, where reducing activated experts can substantially improve efficiency with minimal performance loss while cutting total experts yields limited gains but severe degradation.

Authors:Jun Fang, Yanuo Zhou, Ka I Chan, Jiajin Li, Zeyi Sun, Zhengnan Li, Zicong Fu, Hongjing Piao, Haodong Xu, Yuanchun Shi, Yuntao Wang
Title: A Review of Behavioral Closed-Loop Paradigm from Sensing to Intervention for Ingestion Health
Abstract:
Ingestive behavior plays a critical role in health, yet many existing interventions remain limited to static guidance or manual self-tracking. With the increasing integration of sensors, context-aware computing, and perceptual computing, recent systems have begun to support closed-loop interventions that dynamically sense user behavior and provide feedback during or around ingestion episodes. In this survey, we review 136 studies that leverage sensor-enabled or interaction-mediated approaches to influence ingestive behavior. We propose a behavioral closed-loop paradigm rooted in context-aware computing and inspired by HCI behavior change frameworks, comprising four components: target behaviors, sensing modalities, reasoning and intervention strategies. A taxonomy of sensing and intervention modalities is presented, organized along human- and environment-based dimensions. Our analysis also examines evaluation methods and design trends across different modality-behavior pairings. This review reveals prevailing patterns and critical gaps, offering design insights for future adaptive and context-aware ingestion health interventions.
中文:随着传感器和情境感知计算的发展,闭环干预系统能够动态监测并影响摄食行为,本文通过分析136项研究提出分类框架并揭示未来健康技术的设计方向。
English: Recent advances in sensor and context-aware computing enable closed-loop interventions that dynamically monitor and influence ingestive behavior, with this survey analyzing 136 studies to propose a taxonomy and reveal design insights for future health technologies.

Authors:Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
Title: MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
Abstract:
Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.
中文摘要:MetaScenes通过基于真实扫描的大规模可模拟3D场景数据集和Scan2Sim自动资产替换模型,解决了艺术家驱动设计的可扩展性难题,为具身AI提供了增强泛化能力和仿真到现实应用的新可能。
English Summary: MetaScenes introduces a large-scale, simulatable 3D scene dataset from real-world scans and Scan2Sim, an automated asset replacement model, to overcome scalability challenges in artist-driven designs and enhance Embodied AI through improved generalization and sim-to-real applications.

Authors:Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li
Title: Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets
Abstract:
This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method not only significantly outperforms recent rationalization methods, but also achieves comparable or even better results than a representative LLM (llama3.1-8b-instruct).
本研究揭示了合作式合理化可能因在理由选择中产生伪相关而引入抽样偏差,并提出了解决方法,其性能显著优于现有技术且达到了领先大语言模型的水平。
This study reveals that cooperative rationalization can introduce sampling bias by creating false correlations between selected rationales and labels, and proposes a method to mitigate this issue, achieving superior performance over existing techniques and matching a leading large language model.

Authors:Han Yang, Chuanguang Yang, Qiuli Wang, Zhulin An, Weilun Feng, Libo Huang, Yongjun Xu
Title: Multi-party Collaborative Attention Control for Image Customization
Abstract:
The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.
中文摘要:本文提出MCA-Ctrl方法,通过多主体协同注意力控制和主题定位模块,在无需调参的情况下实现文本与复杂视觉条件的图像定制,有效解决了主体泄露和背景不一致等问题。
English Summary: This paper introduces MCA-Ctrl, a tuning-free method that enables high-quality image customization using both text and complex visual conditions while addressing common issues like subject leakage and inconsistent backgrounds through collaborative attention control and subject localization.

Authors:Loc X. Nguyen, Sheikh Salman Hassan, Yu Min Park, Yan Kyaw Tun, Zhu Han, Choong Seon Hong
Title: SemSpaceFL: A Collaborative Hierarchical Federated Learning Framework for Semantic Communication in 6G LEO Satellites
Abstract:
The advent of the sixth-generation (6G) wireless networks, enhanced by artificial intelligence, promises ubiquitous connectivity through Low Earth Orbit (LEO) satellites. These satellites are capable of collecting vast amounts of geographically diverse and real-time data, which can be immensely valuable for training intelligent models. However, limited inter-satellite communication and data privacy constraints hinder data collection on a single server for training. Therefore, we propose SemSpaceFL, a novel hierarchical federated learning (HFL) framework for LEO satellite networks, with integrated semantic communication capabilities. Our framework introduces a two-tier aggregation architecture where satellite models are first aggregated at regional gateways before final consolidation at a cloud server, which explicitly accounts for satellite mobility patterns and energy constraints. The key innovation lies in our novel aggregation approach, which dynamically adjusts the contribution of each satellite based on its trajectory and association with different gateways, which ensures stable model convergence despite the highly dynamic nature of LEO constellations. To further enhance communication efficiency, we incorporate semantic encoding-decoding techniques trained through the proposed HFL framework, which enables intelligent data compression while maintaining signal integrity. Our experimental results demonstrate that the proposed aggregation strategy achieves superior performance and faster convergence compared to existing benchmarks, while effectively managing the challenges of satellite mobility and energy limitations in dynamic LEO networks.
中文摘要:提出的SemSpaceFL框架为6G低轨卫星网络引入了集成语义通信的分层联邦学习系统,通过考虑卫星移动性和能量约束的动态聚合策略,实现了优越性能和更快收敛速度。
English Summary: The proposed SemSpaceFL framework introduces a hierarchical federated learning system with semantic communication for 6G-enabled LEO satellite networks, featuring dynamic aggregation that accounts for satellite mobility and energy constraints to achieve superior performance and faster convergence.

Authors:Zexin Fu, Riccardo Tedeschi, Gianmarco Ottavi, Nils Wistoff, César Fuguet, Davide Rossi, Luca Benini
Title: Ramping Up Open-Source RISC-V Cores: Assessing the Energy Efficiency of Superscalar, Out-of-Order Execution
Abstract:
Open-source RISC-V cores are increasingly demanded in domains like automotive and space, where achieving high instructions per cycle (IPC) through superscalar and out-of-order (OoO) execution is crucial. However, high-performance open-source RISC-V cores face adoption challenges: some (e.g. BOOM, Xiangshan) are developed in Chisel with limited support from industrial electronic design automation (EDA) tools. Others, like the XuanTie C910 core, use proprietary interfaces and protocols, including non-standard AXI protocol extensions, interrupts, and debug support. In this work, we present a modified version of the OoO C910 core to achieve full RISC-V standard compliance in its debug, interrupt, and memory interfaces. We also introduce CVA6S+, an enhanced version of the dual-issue, industry-supported open-source CVA6 core. CVA6S+ achieves 34.4% performance improvement over CVA6 core. We conduct a detailed performance, area, power, and energy analysis on the superscalar out-of-order C910, superscalar in-order CVA6S+ and vanilla, single-issue in-order CVA6, all implemented in a 22nm technology and integrated into Cheshire, an open-source modular SoC. We examine the performance and efficiency of different microarchitectures using the same ISA, SoC, and implementation with identical technology, tools, and methodologies. The area and performance rankings of CVA6, CVA6S+, and C910 follow expected trends: compared to the scalar CVA6, CVA6S+ shows an area increase of 6% and an IPC improvement of 34.4%, while C910 exhibits a 75% increase in area and a 119.5% improvement in IPC. However, efficiency analysis reveals that CVA6S+ leads in area efficiency (GOPS/mm2), while the C910 is highly competitive in energy efficiency (GOPS/W). This challenges the common belief that high performance in superscalar and out-of-order cores inherently comes at a significant cost in area and energy efficiency.
中文: 本研究通过改进C910核心并增强CVA6S+核心,分析表明高性能超标量和乱序执行的RISC-V核心在保持效率方面具有竞争力,挑战了高性能必然以面积和能效为代价的传统认知。
English: This work presents a modified, fully RISC-V-compliant C910 core and an enhanced CVA6S+ core, demonstrating through analysis that high-performance superscalar and out-of-order RISC-V cores can achieve competitive efficiency, challenging the assumption that such performance inherently sacrifices area and energy efficiency.

Authors:Han Bao, Qinying Wang, Zhi Chen, Qingming Li, Xuhong Zhang, Changjiang Li, Zonghui Wang, Shouling Ji, Wenzhi Chen
Title: VModA: An Effective Framework for Adaptive NSFW Image Moderation
Abstract:
Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.
中文:VModA是一种通用有效的框架,能适应不同审核规则并处理语义复杂的NSFW内容,在各类场景中准确率最高提升54.3%,同时展现出强大的跨类别适应能力。
English: VModA is a versatile framework that effectively adapts to diverse moderation rules and detects complex NSFW content with sophisticated semantics, achieving up to 54.3% higher accuracy and demonstrating strong adaptability across various scenarios and datasets.

Authors:Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling
Title: Vision-Integrated High-Quality Neural Speech Coding
Abstract:
This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.
中文摘要:本文提出了一种新颖的视觉集成神经语音编解码器,通过利用唇部图像的视觉信息来提升语音编码质量,在不增加比特率的情况下增强了抗噪能力。
English Summary: This paper introduces a vision-integrated neural speech codec that enhances speech quality and noise robustness by incorporating visual information from lip images, without increasing the bitrate.

Authors:Jianlin Ye, Savvas Papaioannou, Panayiotis Kolios
Title: VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation
Abstract:
Path planning is a fundamental capability of autonomous Unmanned Aerial Vehicles (UAVs), enabling them to efficiently navigate toward a target region or explore complex environments while avoiding obstacles. Traditional pathplanning methods, such as Rapidly-exploring Random Trees (RRT), have proven effective but often encounter significant challenges. These include high search space complexity, suboptimal path quality, and slow convergence, issues that are particularly problematic in high-stakes applications like disaster response, where rapid and efficient planning is critical. To address these limitations and enhance path-planning efficiency, we propose Vision Language Model RRT (VLM-RRT), a hybrid approach that integrates the pattern recognition capabilities of Vision Language Models (VLMs) with the path-planning strengths of RRT. By leveraging VLMs to provide initial directional guidance based on environmental snapshots, our method biases sampling toward regions more likely to contain feasible paths, significantly improving sampling efficiency and path quality. Extensive quantitative and qualitative experiments with various state-of-the-art VLMs demonstrate the effectiveness of this proposed approach.
中文: 提出的VLM-RRT方法将视觉语言模型与快速探索随机树相结合,通过环境快照提供方向引导来改进无人机路径规划,在复杂场景中显著提高了采样效率和路径质量。
English: The proposed VLM-RRT method integrates Vision Language Models with Rapidly-exploring Random Trees to enhance UAV path planning by using environmental snapshots for directional guidance, significantly improving sampling efficiency and path quality in complex scenarios.

Authors:Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang
Title: HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions
Abstract:
Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.
中文: 针对现有扩散模型在复杂人体动作生成上的不足,我们推出了Open-HyperMotionX数据集与HyperMotionX评测基准,并提出结合空间低频增强RoPE的DiT基线方法,显著提升了高动态序列的结构稳定性和外观一致性。
English: Recent diffusion models struggle with complex human motions, so we introduce the Open-HyperMotionX Dataset and HyperMotionX Bench along with a DiT-based baseline enhanced by spatial low-frequency RoPE to significantly improve animation quality in dynamic sequences.

Authors:Jixin Zhao, Shangchen Zhou, Zhouxia Wang, Peiqing Yang, Chen Change Loy
Title: ObjectClear: Complete Object Removal via Object-Effect Attention
Abstract:
Object removal requires eliminating not only the target object but also its effects, such as shadows and reflections. However, diffusion-based inpainting methods often produce artifacts, hallucinate content, alter background, and struggle to remove object effects accurately. To address this challenge, we introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts. The dataset comprises high-quality captured and simulated data, covering diverse object categories and complex multi-object scenes. Building on OBER, we propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks, effectively decoupling foreground removal from background reconstruction. Furthermore, the predicted attention map enables an attention-guided fusion strategy during inference, greatly preserving background details. Extensive experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.
中文摘要:本文提出OBER数据集和ObjectClear框架,通过物体效应注意力机制有效分离前景移除与背景重建,在复杂场景中显著提升了物体效应消除质量与背景保真度。
English Summary: This paper introduces the OBER dataset and ObjectClear framework to address limitations in diffusion-based inpainting methods, specifically targeting the removal of object effects like shadows and reflections while preserving background details through an innovative attention mechanism.

Authors:Jiaxi Yang, Mengqi Zhang, Yiqiao Jin, Hao Chen, Qingsong Wen, Lu Lin, Yi He, Weijie Xu, James Evans, Jindong Wang
Title: Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems
Abstract:
Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. Nevertheless, the question of how agents should be structurally organized for optimal cooperation remains largely unexplored. In this position paper, we aim to gently redirect the focus of the MAS research community toward this critical dimension: develop topology-aware MASs for specific tasks. Specifically, the system consists of three core components - agents, communication links, and communication patterns - that collectively shape its coordination performance and efficiency. To this end, we introduce a systematic, three-stage framework: agent selection, structure profiling, and topology synthesis. Each stage would trigger new research opportunities in areas such as language models, reinforcement learning, graph learning, and generative modeling; together, they could unleash the full potential of MASs in complicated real-world applications. Then, we discuss the potential challenges and opportunities in the evaluation of multiple systems. We hope our perspective and framework can offer critical new insights in the era of agentic AI.
中文摘要:本立场论文主张通过构建任务感知的多智能体系统拓扑结构,提出包含智能体选择、结构分析和拓扑合成的三阶段框架,以解决复杂任务中协作结构优化的关键问题。
English Summary: This position paper advocates for developing topology-aware multi-agent systems by proposing a three-stage framework—agent selection, structure profiling, and topology synthesis—to optimize collaboration and address structural organization challenges in complex tasks.

Authors:Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan
Title: DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation
Abstract:
Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON's effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.
中文: DORAEMON是一种受认知启发的双流导航框架,模拟人类导航能力,在无需预建地图或训练的情况下实现了零样本自主导航的最优性能。
English: DORAEMON is a cognitive-inspired framework that mimics human navigation through dual-stream processing, achieving state-of-the-art zero-shot navigation performance without prior maps or training.

Authors:Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan
Title: DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation
Abstract:
Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON's effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.
中文: DORAEMON是一种受认知启发的双流导航框架,模拟人类导航能力,在无需预建地图或训练的情况下实现了零样本自主导航的最优性能。
English: DORAEMON is a cognitive-inspired framework that mimics human navigation through dual-stream processing, achieving state-of-the-art zero-shot navigation performance without prior maps or training.

Authors:Yang Yang, Siming Zheng, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang
Title: Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion
Abstract:
Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in temporal flickering and unsatisfactory edge blur transitions due to the lack of temporal modeling and generalization capability. To address these challenges, we propose a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects. Our method leverages a multi-plane image (MPI) representation constructed through a progressively widening depth sampling function, providing explicit geometric guidance for depth-dependent blur synthesis. By conditioning a single-step video diffusion model on MPI layers and utilizing the strong 3D priors from pre-trained models such as Stable Video Diffusion, our approach achieves realistic and consistent bokeh effects across diverse scenes. Additionally, we introduce a progressive training strategy to enhance temporal consistency, depth robustness, and detail preservation. Extensive experiments demonstrate that our method produces high-quality, controllable bokeh effects and achieves state-of-the-art performance on multiple evaluation benchmarks.
中文摘要:本文提出了一种单步扩散框架,通过采用多平面图像表示和渐进式训练策略,实现了时间连贯且深度感知的视频背景虚化渲染,在合成与真实场景基准测试中均优于现有方法。
English Summary: This paper introduces a one-step diffusion framework that achieves temporally coherent and depth-aware video bokeh rendering by leveraging multi-plane image representations and progressive training, outperforming existing methods in both synthetic and real-world benchmarks.

Authors:Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang
Title: Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
Abstract:
Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects. Code will be made publicly available.
中文摘要:本文提出了一种单步扩散框架,通过采用多平面图像表示和渐进式训练策略,实现了时间连贯且深度感知的视频背景虚化渲染,在合成与真实场景基准测试中均优于现有方法。
English Summary: This paper introduces a one-step diffusion framework that achieves temporally coherent and depth-aware video bokeh rendering by leveraging multi-plane image representations and progressive training, outperforming existing methods in both synthetic and real-world benchmarks.

Authors:Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang
Title: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
Abstract:
Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer. We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
中文摘要:MagicTryOn是一种基于扩散变换器的框架,通过分解服装先验保留细节,并利用时空位置编码增强时序一致性,在保持实时性能的同时显著提升了视频虚拟试穿的服装保真度和稳定性。
English Summary: MagicTryOn is a diffusion-transformer framework that enhances video virtual try-on by preserving garment details through decomposed priors and ensuring temporal consistency with spatiotemporal positional embeddings, achieving real-time performance without compromising quality.

Authors:Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang
Title: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
Abstract:
Video Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive video frames, capturing both their dynamics and interactions with human motion. Despite recent progress, existing VVT methods still suffer from inadequate garment fidelity and limited spatiotemporal consistency. The reasons are: (1) under-exploitation of garment information, with limited garment cues being injected, resulting in weaker fine-detail fidelity; and (2) a lack of spatiotemporal modeling, which hampers cross-frame identity consistency and causes temporal jitter and appearance drift. In this paper, we present MagicTryOn, a diffusion-transformer based framework for garment-preserving video virtual try-on. To preserve fine-grained garment details, we propose a fine-grained garment-preservation strategy that disentangles garment cues and injects these decomposed priors into the denoising process. To improve temporal garment consistency and suppress jitter, we introduce a garment-aware spatiotemporal rotary positional embedding (RoPE) that extends RoPE within full self-attention, using spatiotemporal relative positions to modulate garment tokens. We further impose a mask-aware loss during training to enhance fidelity within garment regions. Moreover, we adopt distribution-matching distillation to compress the sampling trajectory to four steps, enabling real-time inference without degrading garment fidelity. Extensive quantitative and qualitative experiments demonstrate that MagicTryOn outperforms existing methods, delivering superior garment-detail fidelity and temporal stability in unconstrained settings.
中文摘要:MagicTryOn是一种基于扩散变换器的框架,通过分解服装先验保留细节,并利用时空位置编码增强时序一致性,在保持实时性能的同时显著提升了视频虚拟试穿的服装保真度和稳定性。
English Summary: MagicTryOn is a diffusion-transformer framework that enhances video virtual try-on by preserving garment details through decomposed priors and ensuring temporal consistency with spatiotemporal positional embeddings, achieving real-time performance without compromising quality.

Authors:Liuhan Chen, Xiaodong Cun, Xiaoyu Li, Xianyi He, Shenghai Yuan, Jie Chen, Ying Shan, Li Yuan
Title: EF-VI: Enhancing End-Frame Injection for Video Inbetweening
Abstract:
Video inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods primarily extend large-scale pre-trained Image-to-Video Diffusion Models (I2V-DMs) by incorporating the end-frame condition via direct fine-tuning or temporally bidirectional sampling. However, the former results in a weak end-frame constraint, while the latter inevitably disrupts the input representation of video frames, leading to suboptimal performance. To improve the end-frame constraint while avoiding disruption of the input representation, we propose a novel video inbetweening framework specific to recent and more powerful transformer-based I2V-DMs, termed EF-VI. It efficiently strengthens the end-frame constraint by utilizing an enhanced injection. This is based on our proposed well-designed lightweight module, termed EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. Extensive experiments demonstrate the superiority of our EF-VI compared with other baselines.
中文:提出的EF-VI框架通过引入轻量级模块EF-Net,将结束帧扩展为时序自适应特征注入基于变换器的扩散模型,从而强化结束帧约束以提升视频插帧性能,实验证明其优于现有基线方法。
English: The proposed EF-VI framework enhances video inbetweening by introducing EF-Net, a lightweight module that strengthens end-frame constraints through adaptive feature injection into transformer-based diffusion models, outperforming existing methods.

Authors:Valentin Carl, Trever Schirmer, Niklas Kowallik, Joshua Adamek, Tobias Pfandzelter, Sergio Lucia, David Bermbach
Title: Multi-Event Triggers for Serverless Computing
Abstract:
Function-as-a-Service (FaaS) is an event-driven serverless cloud computing model in which small, stateless functions are invoked in response to events, such as HTTP requests, new database entries, or messages. Current FaaS platform assume that each function invocation corresponds to a single event. However, from an application perspective, it is desirable to invoke functions in response to a collection of events of different types or only with every n\textsuperscript{th} event. To implement this today, a function would need additional state management, e.g., in a database, and custom logic to determine whether its trigger condition is fulfilled and the actual application code should run. In such an implementation, most function invocations would be rendered essentially useless, leading to unnecessarily high resource usage, latency, and cost for applications. In this paper, we introduce multi-event triggers, through which complex conditions for function invocations can be specified. Specifically, we introduce abstractions for invoking functions based on a set of $n$ events and joins of multiple events of different types. This enables application developers to define intricate conditions for function invocations, workflow steps, and complex event processing. Our evaluation with a proof-of-concept prototype shows that this reduces event--invocation latency by 62.5\% in an incident detection use-case and that our system can handle more than 300,000 requests per second on limited hardware, which is sufficient load for implementation in large FaaS platforms.
中文: 本文提出面向函数即服务(FaaS)平台的多事件触发器,通过设定复杂调用条件将事件调用延迟降低62.5%,在有限硬件上实现每秒30万次请求处理,同时避免冗余函数执行。
English: This paper introduces multi-event triggers for Function-as-a-Service (FaaS) platforms, enabling complex invocation conditions that reduce latency by 62.5% and handle over 300,000 requests per second while eliminating unnecessary function executions.

Authors:Yu Yan, Sheng Sun, Zhifei Zheng, Ziji Hao, Teli Liu, Min Liu
Title: PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing
Abstract:
To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
中文摘要:PoisonSwarm是一种创新框架,通过采用模型众包策略和动态逐单元毒化技术,在保持高成功率的同时有效生成多样化的有害信息数据。
English Summary: PoisonSwarm is a novel framework that enhances harmful information synthesis by employing model crowdsourcing and dynamic unit-by-unit toxification to achieve high diversity and success rates in generating adversarial data.

Authors:Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, David Lo
Title: An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Abstract:
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.
中文: SWE-Judge是一种新颖的集成评估指标,能准确评估大语言模型生成的软件工件的正确性,与人工评估具有显著更高的相关性,为人工评估提供了可扩展的替代方案。
English: SWE-Judge is a novel ensemble evaluation metric that accurately assesses the correctness of software artifacts generated by LLMs, demonstrating significantly higher correlation with human judgments and offering a scalable alternative to manual evaluation.

Authors:Di Yu, Changze Lv, Xin Du, Linshan Jiang, Wentao Tong, Zhenyu Liao, Xiaoqing Zheng, Shuiguang Deng
Title: ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks
Abstract:
Most edge-cloud collaboration frameworks rely on the substantial computational and storage capabilities of cloud-based artificial neural networks (ANNs). However, this reliance results in significant communication overhead between edge devices and the cloud and high computational energy consumption, especially when applied to resource-constrained edge devices. To address these challenges, we propose ECC-SNN, a novel edge-cloud collaboration framework incorporating energy-efficient spiking neural networks (SNNs) to offload more computational workload from the cloud to the edge, thereby improving cost-effectiveness and reducing reliance on the cloud. ECC-SNN employs a joint training approach that integrates ANN and SNN models, enabling edge devices to leverage knowledge from cloud models for enhanced performance while reducing energy consumption and processing latency. Furthermore, ECC-SNN features an on-device incremental learning algorithm that enables edge models to continuously adapt to dynamic environments, reducing the communication overhead and resource consumption associated with frequent cloud update requests. Extensive experimental results on four datasets demonstrate that ECC-SNN improves accuracy by 4.15%, reduces average energy consumption by 79.4%, and lowers average processing latency by 39.1%.
Chinese: 提出的ECC-SNN框架采用节能脉冲神经网络将计算任务从云端转移至边缘设备,在提升准确率4.15%的同时,显著降低能耗79.4%与处理延迟39.1%。
English: The proposed ECC-SNN framework utilizes energy-efficient spiking neural networks to shift computational tasks from the cloud to edge devices, enhancing accuracy by 4.15% while cutting energy use by 79.4% and latency by 39.1%.

Authors:Lujian Yao, Siming Zheng, Xinbin Yuan, Zhuoxuan Cai, Pu Wu, Jinwei Chen, Bo Li, Peng-Tao Jiang
Title: Photography Perspective Composition: Towards Aesthetic Perspective Recommendation
Abstract:
Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from suboptimal to optimal perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
中文摘要:本文提出摄影视角构图(PPC)方法,突破传统二维裁剪局限,通过调整主体投影关系保持空间位置实现三维重构,并建立了自动化数据集构建、视角转换演示和基于人类评估的质量评价体系。
English Summary: This paper introduces Photography Perspective Composition (PPC), a novel 3D recomposition technique that overcomes limitations of traditional 2D cropping by adjusting subject projections while preserving spatial relationships, supported by automated dataset creation, transformation visualization, and human-based quality assessment.

Authors:Lujian Yao, Siming Zheng, Xinbin Yuan, Zhuoxuan Cai, Pu Wu, Jinwei Chen, Bo Li, Peng-Tao Jiang
Title: Photography Perspective Composition: Towards Aesthetic Perspective Recommendation
Abstract:
Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from less favorable to aesthetically enhanced perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
中文摘要:本文提出摄影视角构图(PPC)方法,突破传统二维裁剪局限,通过调整主体投影关系保持空间位置实现三维重构,并建立了自动化数据集构建、视角转换演示和基于人类评估的质量评价体系。
English Summary: This paper introduces Photography Perspective Composition (PPC), a novel 3D recomposition technique that overcomes limitations of traditional 2D cropping by adjusting subject projections while preserving spatial relationships, supported by automated dataset creation, transformation visualization, and human-based quality assessment.

Authors:Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu
Title: MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Abstract:
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
中文: 本文提出首个系统评估多模态大语言模型视角几何理解能力的基准MMPerspective,通过三个维度的10项任务揭示了模型虽具备表层感知能力,但在组合推理和空间一致性方面存在显著局限。
English: This paper introduces MMPerspective, the first benchmark to systematically evaluate multimodal large language models' understanding of perspective geometry through 10 tasks across three dimensions, revealing significant limitations in compositional reasoning and spatial consistency despite surface-level perceptual competence.

Authors:Yolo Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu
Title: MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Abstract:
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
中文: 本文提出首个系统评估多模态大语言模型视角几何理解能力的基准MMPerspective,通过三个维度的10项任务揭示了模型虽具备表层感知能力,但在组合推理和空间一致性方面存在显著局限。
English: This paper introduces MMPerspective, the first benchmark to systematically evaluate multimodal large language models' understanding of perspective geometry through 10 tasks across three dimensions, revealing significant limitations in compositional reasoning and spatial consistency despite surface-level perceptual competence.

Authors:Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen
Title: CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Abstract:
Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.
中文: 提出的CODE-DITING方法通过将推理能力蒸馏到小型模型中,在代码评估中有效平衡了准确性、效率和可解释性,以极少的计算资源超越了大型模型的性能。
English: The proposed CODE-DITING method effectively balances accuracy, efficiency, and explainability in code evaluation by distilling reasoning capabilities into smaller models, outperforming larger counterparts while using minimal computational resources.

Authors:Zhengkang Guo, Wenhao Liu, Mingchen Xie, Jingwen Xu, Zisu Huang, Muzhao Tian, Jianhan Xu, Muling Wu, Xiaohua Wang, Changze Lv, He-Da Wang, Hu Yao, Xiaoqing Zheng, Xuanjing Huang
Title: RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
Abstract:
Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users' growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions. To address this challenge, we propose RECAST, a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types. Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions. Moreover, the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.
中文: RECAST框架高效构建包含大量约束的大规模数据集,旨在提升大语言模型遵循复杂指令的能力,实验证明其能显著增强模型性能且不影响通用性。
English: The RECAST framework efficiently creates large-scale datasets with numerous constraints to enhance large language models' ability to follow complex instructions, significantly improving their performance without compromising general capabilities.

Authors:Zhengkang Guo, Wenhao Liu, Mingchen Xie, Jingwen Xu, Zisu Huang, Muzhao Tian, Jianhan Xu, Yuanzhe Shen, Qi Qian, Muling Wu, Xiaohua Wang, Changze Lv, He-Da Wang, Hu Yao, Xiaoqing Zheng, Xuanjing Huang
Title: RECAST: Expanding the Boundaries of LLMs' Complex Instruction Following with Multi-Constraint Data
Abstract:
Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users' growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions, which limits their applicability in complex real-world scenarios. To the best of our knowledge, existing datasets do not exceed 10 constraints per instance. To address this challenge, we propose RECAST, an efficient and scalable framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks, aiming to challenge and extend the boundaries of models' ability to follow complex instructions. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 19 constraint types. Experimental results demonstrate that models finetuned on RECAST-30K substantially improve in following complex instructions while maintaining their general capabilities without degradation. Moreover, RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones; the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.
中文: RECAST框架高效构建包含大量约束的大规模数据集,旨在提升大语言模型遵循复杂指令的能力,实验证明其能显著增强模型性能且不影响通用性。
English: The RECAST framework efficiently creates large-scale datasets with numerous constraints to enhance large language models' ability to follow complex instructions, significantly improving their performance without compromising general capabilities.

Authors:Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, Guangliang Cheng
Title: So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection
Abstract:
Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.
中文: 本研究推出了大规模合成图像数据集So-Fake-Set和视觉语言检测框架So-Fake-R1,在社交媒体AI伪造内容的识别与定位方面展现出卓越性能。
English: This research introduces So-Fake-Set, a large-scale dataset of synthetic images, and So-Fake-R1, a vision-language detection framework that demonstrates superior performance in identifying and localizing AI-generated forgeries on social media.

Authors:Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, Guangliang Cheng
Title: So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection
Abstract:
Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.
中文: 本研究推出了大规模合成图像数据集So-Fake-Set和视觉语言检测框架So-Fake-R1,在社交媒体AI伪造内容的识别与定位方面展现出卓越性能。
English: This research introduces So-Fake-Set, a large-scale dataset of synthetic images, and So-Fake-R1, a vision-language detection framework that demonstrates superior performance in identifying and localizing AI-generated forgeries on social media.

Authors:Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, Xinwang Liu
Title: Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Abstract:
Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
中文: 现有思维链蒸馏方法因一次性训练小型语言模型学习长推理链导致梯度平滑和响应缓慢,而提出的分块训练将推理分解为语义连贯的块进行专注学习,跳跃思维训练则通过跳过非推理块在保证准确性的同时提升推理速度。
English: Existing chain-of-thought distillation methods cause gradient smoothing and slow responses by training small language models on long rationales at once, but the proposed chunk-wise training divides reasoning into focused chunks while skip-thinking training accelerates answers by skipping non-essential parts.

Authors:Kerem Oktar, Katherine M. Collins, Jose Hernandez-Orallo, Diane Coyle, Stephen Cave, Adrian Weller, Ilia Sucholutsky
Title: Identifying, Evaluating, and Mitigating Risks of AI Thought Partnerships
Abstract:
Artificial Intelligence (AI) systems have historically been used as tools that execute narrowly defined tasks. Yet recent advances in AI have unlocked possibilities for a new class of models that genuinely collaborate with humans in complex reasoning, from conceptualizing problems to brainstorming solutions. Such AI thought partners enable novel forms of collaboration and extended cognition, yet they also pose major risks-including and beyond risks of typical AI tools and agents. In this commentary, we systematically identify risks of AI thought partners through a novel framework that identifies risks at multiple levels of analysis, including Real-time, Individual, and Societal risks arising from collaborative cognition (RISc). We leverage this framework to propose concrete metrics for risk evaluation, and finally suggest specific mitigation strategies for developers and policymakers. As AI thought partners continue to proliferate, these strategies can help prevent major harms and ensure that humans actively benefit from productive thought partnerships.
Chinese: 人工智能的最新进展催生了能与人类协作思考的伙伴,虽提升认知能力却带来多层次风险,需通过系统化评估和缓解策略确保人机协作的安全与效益。
English: Recent AI advances enable collaborative thought partners that enhance human reasoning but introduce multi-level risks, requiring systematic evaluation and mitigation strategies to ensure safe and beneficial human-AI collaboration.

Authors:Junchi Yao, Jianhua Xu, Tianyu Xin, Ziyi Wang, Shenzhe Zhu, Shu Yang, Di Wang
Title: Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning
Abstract:
The rise of Large Language Model-based Multi-Agent Planning has leveraged advanced frameworks to enable autonomous and collaborative task execution. Some systems rely on platforms like review sites and social media, which are prone to fraudulent information, such as fake reviews or misleading descriptions. This reliance poses risks, potentially causing financial losses and harming user experiences. To evaluate the risk of planning systems in real-world applications, we introduce \textbf{WandaPlan}, an evaluation environment mirroring real-world data and injected with deceptive content. We assess system performance across three fraud cases: Misinformation Fraud, Team-Coordinated Multi-Person Fraud, and Level-Escalating Multi-Round Fraud. We reveal significant weaknesses in existing frameworks that prioritize task efficiency over data authenticity. At the same time, we validate WandaPlan's generalizability, capable of assessing the risks of real-world open-source planning frameworks. To mitigate the risk of fraud, we propose integrating an anti-fraud agent, providing a solution for reliable planning.
Chinese Summary: 本研究提出了WandaPlan评估环境,揭示了大语言模型多智能体规划系统在面对虚假信息时的脆弱性,并通过引入反欺诈代理来提升系统可靠性。
English Summary: The study introduces WandaPlan, an evaluation environment that exposes vulnerabilities in Large Language Model-based Multi-Agent Planning systems to deceptive content, and proposes integrating an anti-fraud agent to enhance reliability.

Authors:Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Jinda Lu, Zheming Yang, Tian Wen
Title: Recursive Offloading for LLM Serving in Multi-tier Networks
Abstract:
Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50\% compared to centralized cloud-based serving.
中文摘要:RecServe是一种面向多级网络的递归卸载框架,通过分层置信度评估与动态阈值调整优化大语言模型服务,在保证服务质量的同时显著降低了通信负担。
English Summary: RecServe is a recursive offloading framework that optimizes LLM serving in multi-tier networks through hierarchical confidence evaluation and dynamic threshold adjustments, significantly reducing communication overhead while maintaining service quality.

Authors:Jun Rao, Xuebo Liu, Hexuan Deng, Zepeng Lin, Zixiong Yu, Jiansheng Wei, Xiaojun Meng, Min Zhang
Title: Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning
Abstract:
In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model's stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.
Chinese: 现有推理任务的数据选择方法依赖难度和多样性等静态指标,无法适应在线训练中模型能力的动态变化,为此我们提出了SAI-DPO算法,通过实时性能反馈动态选择数据以提高效率和任务表现,实验显示其性能提升高达21.3个百分点。
English: Current data selection methods for reasoning tasks rely on static metrics like difficulty and diversity, which fail to adapt to evolving model capabilities during online training, prompting the introduction of SAI-DPO, an algorithm that dynamically selects data based on real-time performance feedback to enhance efficiency and task outcomes, achieving up to a 21.3 percentage point performance boost in experiments.

Authors:Zhiyuan Xu, Bohan Li, Huan-ang Gao, Mingju Gao, Yong Chen, Ming Liu, Chenxu Yan, Hang Zhao, Shuo Feng, Hao Zhao
Title: Challenger: Affordable Adversarial Driving Video Generation
Abstract:
Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.
中文:Challenger框架通过轨迹优化和评分技术生成逼真的对抗性驾驶视频,有效提升了自动驾驶模型的碰撞率,并展现出跨模型的攻击迁移性。
English: The Challenger framework generates photorealistic adversarial driving videos by optimizing traffic interactions and sensor data through trajectory refinement and scoring, significantly increasing collision rates in autonomous driving models.

Authors:Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
Title: Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
Abstract:
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
中文: 本文提出的Chain-of-Focus方法通过两阶段训练使视觉语言模型能基于视觉线索自适应聚焦关键图像区域,在多个基准测试中显著提升了多模态推理性能。
English: This paper introduces a Chain-of-Focus (CoF) method that enhances vision language models' multimodal reasoning by adaptively focusing on key image regions through a two-stage training pipeline, achieving significant performance improvements across multiple benchmarks.

Authors:Eduarda Caldeira, Jan Niklas Kolf, Naser Damer, Fadi Boutros
Title: DiffProb: Data Pruning for Face Recognition
Abstract:
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
中文: 本文提出DiffProb,一种面向人脸识别的数据剪枝方法,通过剔除预测概率相似的冗余训练样本并加入清理机制去除错误标记数据,能在削减高达50%数据集的同时保持甚至提升多个基准测试的准确率。
English: This paper introduces DiffProb, a novel data pruning method for face recognition that selectively removes redundant training samples with similar prediction probabilities and incorporates a cleaning mechanism to eliminate mislabeled data, achieving up to 50% dataset reduction while maintaining or improving accuracy across multiple benchmarks.

Authors:Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo
Title: An Efficient Private GPT Never Autoregressively Decodes
Abstract:
The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.
Chinese: 本研究提出了一种公共解码与安全验证方法,通过客户端使用公共模型生成令牌并由私有模型安全验证,在保持隐私和生成质量的同时,显著提升了安全推理的效率。
English: This study introduces a public decoding and secure verification method to accelerate secure GPT inference by allowing clients to generate tokens with a public model and then securely verify them with a private model, achieving significant speed improvements while maintaining privacy and quality.

Authors:Haijun Li, Tianqi Shi, Zifu Shang, Yuxuan Han, Xueyu Zhao, Hao Wang, Yu Qian, Zhiqiang Qian, Linlong Xu, Minghao Wu, Chenyang Lyu, Longyue Wang, Gongbo Tang, Weihua Luo, Zhao Xu, Kaifu Zhang
Title: TransBench: Benchmarking Machine Translation for Industrial-Scale Applications
Abstract:
Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.
中文: 机器翻译虽因大语言模型进步而提升,但通用模型在行业应用中存在术语和文化适配等局限,为此我们推出TransBench这一工业级基准,通过多层次评估框架和专用指标填补实际效能与学术标准间的鸿沟。
English: Recent advances in machine translation have improved quality, but general models struggle with domain-specific needs, leading to the development of TransBench, a specialized benchmark for industrial applications that integrates multi-level evaluation metrics and tools.

Authors:Mingliang Zhai, Zhi Gao, Yuwei Wu, Yunde Jia
Title: Memory-Centric Embodied Question Answer
Abstract:
Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.
中文: 本文提出以记忆为中心的MemoryEQA框架,通过将记忆信息灵活注入所有模块来提升具身问答的效率和准确性,在复杂任务上的显著性能提升验证了其有效性。
English: This paper introduces MemoryEQA, a memory-centric framework for Embodied Question Answering that enhances efficiency and accuracy by flexibly feeding memory information into all modules, validated by significant performance gains on complex tasks.

Authors:Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling
Title: Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising
Abstract:
Large language model (LLM) based zero-shot text-to-speech (TTS) methods tend to preserve the acoustic environment of the audio prompt, leading to degradation in synthesized speech quality when the audio prompt contains noise. In this paper, we propose a novel neural codec-based speech denoiser and integrate it with the advanced LLM-based TTS model, LauraTTS, to achieve noise-robust zero-shot TTS. The proposed codec denoiser consists of an audio codec, a token denoiser, and an embedding refiner. The token denoiser predicts the first two groups of clean acoustic tokens from the noisy ones, which can serve as the acoustic prompt for LauraTTS to synthesize high-quality personalized speech or be converted to clean speech waveforms through the embedding refiner and codec decoder. Experimental results show that our proposed codec denoiser outperforms state-of-the-art speech enhancement (SE) methods, and the proposed noise-robust LauraTTS surpasses the approach using additional SE models.
Chinese Summary: 本研究提出了一种基于神经编解码器的语音去噪器,并与LauraTTS模型结合实现抗噪声的零样本语音合成,实验证明其性能优于现有语音增强方法。
English Summary: The study introduces a neural codec-based speech denoiser integrated with LauraTTS to address noise degradation in zero-shot text-to-speech systems, demonstrating superior performance over existing methods.

Authors:Tian Wen, Sheng Sun, Yuwei Wang, Peiyan Chen, Zhiyuan Wu, Min Liu, Bo Gao
Title: SVAFD: A Secure and Verifiable Co-Aggregation Protocol for Federated Distillation
Abstract:
Secure Aggregation (SA) is an indispensable component of Federated Learning (FL) that concentrates on privacy preservation while allowing for robust aggregation. However, most SA designs rely heavily on the unrealistic assumption of homogeneous model architectures. Federated Distillation (FD), which aggregates locally computed logits instead of model parameters, introduces a promising alternative for cooperative training in heterogeneous model settings. Nevertheless, we recognize two major challenges in implementing SA for FD. (i) Prior SA designs encourage a dominant server, who is solely responsible for collecting, aggregating and distributing. Such central authority facilitates server to forge aggregation proofs or collude to bypass the claimed security guarantees; (ii) Existing SA, tailored for FL models, overlook the intrinsic properties of logits, making them unsuitable for FD. To address these challenges, we propose SVAFD, the first SA protocol that is specifically designed for FD. At a high level, SVAFD incorporates two innovations: (i) a multilateral co-aggregation method tha redefines the responsibilities of clients and server. Clients autonomously evaluate and aggregate logits shares locally with a lightweight coding scheme, while the server handles ciphertext decoding and performs the task of generating verification proofs; (ii) a quality-aware knowledge filtration method that facilitates biased logits exclusion against poisoning attacks. Moreover, SVAFD is resilient to stragglers and colluding clients, making it well-suited for dynamic networks in real-world applications. We have implemented the SVAFD prototype over four emerging FD architectures and evaluated it against poisoning and inference attacks. Results demonstrate that SVAFD improves model accuracy, making it a significant step forward in secure and verifiable aggregation for heterogeneous FL systems.
中文: SVAFD是首个专为联邦蒸馏设计的安全聚合协议,通过多方协同聚合和质量感知知识过滤机制,有效解决了异构模型场景下的安全漏洞与概率输出兼容性问题。
English: SVAFD is the first secure aggregation protocol designed specifically for Federated Distillation, introducing multilateral co-aggregation and quality-aware knowledge filtration to address security vulnerabilities and logit incompatibility in heterogeneous model settings.

Authors:Trever Schirmer, Valentin Carl, Nils Höller, Tobias Pfandzelter, David Bermbach
Title: Minos: Exploiting Cloud Performance Variation with Function-as-a-Service Instance Selection
Abstract:
Serverless Function-as-a-Service (FaaS) is a popular cloud paradigm to quickly and cheaply implement complex applications. Because the function instances cloud providers start to execute user code run on shared infrastructure, their performance can vary. From a user perspective, slower instances not only take longer to complete, but also increase cost due to the pay-per-use model of FaaS services where execution duration is billed with microsecond accuracy. In this paper, we present Minos, a system to take advantage of this performance variation by intentionally terminating instances that are slow. Fast instances are not terminated, so that they can be re-used for subsequent invocations. One use case for this are data processing and machine learning workflows, which often download files as a first step, during which Minos can run a short benchmark. Only if the benchmark passes, the main part of the function is actually executed. Otherwise, the request is re-queued and the instance crashes itself, so that the platform has to assign the request to another (potentially faster) instance. In our experiments, this leads to a speedup of up to 13% in the resource intensive part of a data processing workflow, resulting in up to 4% faster overall performance (and consequently 4% cheaper prices). Longer and complex workflows lead to increased savings, as the pool of fast instances is re-used more often. For platforms exhibiting this behavior, users get better performance and save money by wasting more of the platforms resources.
Chinese: Minos系统通过主动终止运行缓慢的函数实例并重用快速实例,提升了无服务器计算的性能并降低了成本,在数据处理工作流中实现了高达13%的加速和4%的整体节省。
English: Minos is a system that improves performance and reduces costs in serverless computing by terminating slow function instances and reusing faster ones, achieving up to 13% speedup in data processing workflows and 4% overall savings.

Authors:Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, Bo Li
Title: Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Abstract:
Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3\% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2\%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.
中文: 图形用户界面代理在执行用户指令方面取得进展,但精确定位界面元素仍具挑战,新提出的强化学习框架通过种子数据筛选、密集策略梯度和自进化微调机制,仅用少量样本即在多个基准测试中达到顶尖性能。
English: GUI agents have advanced in executing user commands but struggle with precise element grounding, which a new reinforcement learning framework addresses using curated data, dense policy gradients, and self-evolutionary finetuning to achieve state-of-the-art results with minimal training samples.

Authors:Kui Jiang, Jing Cao, Zhaocheng Yu, Junjun Jiang, Jingchun Zhou
Title: Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather
Abstract:
Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called \textbf{ACDepth} from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-tune the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50\% for night scene and 2.61\% for rainy scene on the nuScenes dataset in terms of the absRel metric.
中文: ACDepth是一种鲁棒的单目深度估计方法,通过扩散模型生成高质量退化训练数据,并采用多粒度知识蒸馏策略学习与退化无关的场景信息,从而提升恶劣天气下的性能表现。
English: ACDepth is a robust monocular depth estimation method that enhances performance in adverse weather by generating high-quality degraded training data through a diffusion model and employing multi-granularity knowledge distillation to learn degradation-agnostic scene information.

Authors:Chang Liu, Huan Yan, Hongjie Sui, Haomin Wen, Yuan Yuan, Yuyang Han, Hongsen Liao, Xuetao Ding, Jinghua Hao, Yong Li
Title: MRGRP: Empowering Courier Route Prediction in Food Delivery Service with Multi-Relational Graph
Abstract:
Instant food delivery has become one of the most popular web services worldwide due to its convenience in daily life. A fundamental challenge is accurately predicting courier routes to optimize task dispatch and improve delivery efficiency. This enhances satisfaction for couriers and users and increases platform profitability. The current heuristic prediction method uses only limited human-selected task features and ignores couriers preferences, causing suboptimal results. Additionally, existing learning-based methods do not fully capture the diverse factors influencing courier decisions or the complex relationships among them. To address this, we propose a Multi-Relational Graph-based Route Prediction (MRGRP) method that models fine-grained correlations among tasks affecting courier decisions for accurate prediction. We encode spatial and temporal proximity, along with pickup-delivery relationships, into a multi-relational graph and design a GraphFormer architecture to capture these complex connections. We also introduce a route decoder that leverages courier information and dynamic distance and time contexts for prediction, using existing route solutions as references to improve outcomes. Experiments show our model achieves state-of-the-art route prediction on offline data from cities of various sizes. Deployed on the Meituan Turing platform, it surpasses the current heuristic algorithm, reaching a high route prediction accuracy of 0.819, essential for courier and user satisfaction in instant food delivery.
Chinese: 本文提出基于多关系图的路线预测方法(MRGRP),通过GraphFormer架构精准建模骑手决策因素,在美团平台实现0.819的路线预测准确率,显著提升了即时配送效率。
English: This paper introduces a Multi-Relational Graph-based Route Prediction (MRGRP) method that accurately models courier decision factors using a GraphFormer architecture, achieving state-of-the-art prediction accuracy of 0.819 on the Meituan platform to optimize delivery efficiency.

Authors:Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
Title: Parallel Scaling Law for Language Models
Abstract:
It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.
中文摘要:ParScale提出了一种创新的并行扩展方法,通过在训练和推理阶段增加并行计算来提升模型性能,相比传统参数扩展方法,能以显著减少的内存占用和延迟实现更优的效率。
English Summary: ParScale introduces a novel parallel scaling method that enhances model performance by increasing parallel computation during training and inference, offering superior efficiency with significantly reduced memory and latency compared to traditional parameter scaling.

Authors:Tiancong Cheng, Ying Zhang, Yuxuan Liang, Roger Zimmermann, Zhiwen Yu, Bin Guo
Title: JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation
Abstract:
Depth estimation and scene segmentation are two important tasks in intelligent transportation systems. A joint modeling of these two tasks will reduce the requirement for both the storage and training efforts. This work explores how the multi-task distillation could be used to improve such unified modeling. While existing solutions transfer multiple teachers' knowledge in a static way, we propose a self-adaptive distillation method that can dynamically adjust the knowledge amount from each teacher according to the student's current learning ability. Furthermore, as multiple teachers exist, the student's gradient update direction in the distillation is more prone to be erroneous where knowledge forgetting may occur. To avoid this, we propose a knowledge trajectory to record the most essential information that a model has learnt in the past, based on which a trajectory-based distillation loss is designed to guide the student to follow the learning curve similarly in a cost-effective way. We evaluate our method on multiple benchmarking datasets including Cityscapes and NYU-v2. Compared to the state-of-the-art solutions, our method achieves a clearly improvement. The code is provided in the supplementary materials.
Chinese: 本文提出了一种自适应蒸馏方法,可根据学生当前学习能力动态调整教师知识传递,并设计知识轨迹机制防止遗忘,在深度估计与场景分割联合任务中实现了更优性能。
English: This paper introduces a self-adaptive distillation method that dynamically adjusts teacher knowledge transfer and a knowledge trajectory mechanism to prevent forgetting, achieving superior performance in joint depth estimation and scene segmentation tasks.

Authors:Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
Title: Crosslingual Reasoning through Test-Time Scaling
Abstract:
Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
中文摘要:以英语为中心的推理语言模型通过扩展推理计算可实现跨语言的数学推理泛化,在高资源语言中表现更优,但在低资源语言和领域外情境中仍有局限。
English summary: English-centric reasoning language models can generalize mathematical reasoning across languages through inference scaling and exhibit better performance in high-resource languages, though they struggle with low-resource languages and out-of-domain contexts.

Authors:Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, Bin Shi
Title: Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
Abstract:
Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.
中文: 代码大语言模型和智能体在软件工程任务中展现出比传统方法更强的能力,但现有基准测试覆盖严重失衡,约60%集中于开发阶段,而需求工程和软件设计阶段仅占5%和3%。
English: Code large language models and agents demonstrate superior capabilities in software engineering tasks compared to traditional methods, but current benchmarks show a significant imbalance with heavy focus on development phases while neglecting requirements engineering and design.

Authors:Hendrik Möller, Hanna Schön, Alina Dima, Benjamin Keinert-Weth, Robert Graf, Matan Atad, Johannes Paetzold, Friederike Jungmann, Rickmer Braren, Florian Kofler, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke
Title: Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort
Abstract:
Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar transitional vertebrae or enumeration anomalies. While some studies manually assess these anomalies and describe the ribs qualitatively, this study aims to automate thoracolumbar stump rib detection and analyze their morphology quantitatively. To this end, we train a high-resolution deep-learning model for rib segmentation and show significant improvements compared to existing models (Dice score 0.997 vs. 0.779, p-value < 0.01). In addition, we use an iterative algorithm and piece-wise linear interpolation to assess the length of the ribs, showing a success rate of 98.2%. When analyzing morphological features, we show that stump ribs articulate more posteriorly at the vertebrae (-19.2 +- 3.8 vs -13.8 +- 2.5, p-value < 0.01), are thinner (260.6 +- 103.4 vs. 563.6 +- 127.1, p-value < 0.01), and are oriented more downwards and sideways within the first centimeters in contrast to full-length ribs. We show that with partially visible ribs, these features can achieve an F1-score of 0.84 in differentiating stump ribs from regular ones. We publish the model weights and masks for public use.
Chinese: 本研究开发了一种高分辨率深度学习模型,用于自动检测和定量分析胸腰椎残端肋骨,不仅展现了卓越的分割性能,还揭示了其与全长肋骨的关键形态差异,并公开了模型资源供公众使用。
English: This study develops a high-resolution deep-learning model to automatically detect and quantitatively analyze thoracolumbar stump ribs, demonstrating superior segmentation performance and identifying key morphological differences from full-length ribs, with model resources made publicly available.

Authors:Minghe Wang, Alexandra Kapp, Trever Schirmer, Tobias Pfandzelter, David Bermbach
Title: Exploring Influence Factors on LLM Suitability for No-Code Development of End User IoT Applications
Abstract:
With the increasing popularity of IoT applications, end users demand more personalized and intuitive functionality. A major obstacle for this, however, is that custom IoT functionality today still requires at least some coding skills. To address this, no-code development platforms have been proposed as a solution for empowering non-technical users to create applications. However, such platforms still require a certain level of technical expertise for structuring process steps or defining event-action relations. The advent of LLMs can further enhance no-code platforms by enabling natural language-based interaction, automating of complex tasks, and dynamic code generation. By allowing users to describe their requirements in natural language, LLMs can significantly streamline no-code development. As LLMs vary in performance, architecture, training data used, and the use cases they target, it is still unclear which models are best suited and what are the influence factors determining this fit. In particular, no-code development of IoT applications by non-technical users will have completely different demands on LLMs than, e.g., code generation for more open-ended applications or for supporting professional developers. In this paper, we explore the factors influencing the suitability of LLMs to no-code development of IoT applications. We also examine the role of input prompt language on accuracy and quality of generated applications as well as the influence of LLM training data. By conducting comprehensive experiments with a range of LLMs, we provide valuable insights for optimizing LLM-powered no-code platforms, guiding the selection of the suitable LLMs and their effective application. Our findings contribute to improving the accessibility, efficiency, and user experience of no-code IoT development, ultimately enabling broader adoption of IoT technologies among non-expert users.
中文: 本文通过分析模型选择、提示语言影响和训练数据等因素,研究大型语言模型如何优化无代码物联网开发平台,从而提升非技术用户的使用体验和普及度。
English: This paper investigates how large language models can enhance no-code IoT development by analyzing factors like model selection, prompt language impact, and training data to improve platform accessibility for non-technical users.

Authors:Riccardo Tedeschi, Gianmarco Ottavi, Côme Allart, Nils Wistoff, Zexin Fu, Filippo Grillotti, Fabio De Ambroggi, Elio Guidetti, Jean-Baptiste Rigaud, Olivier Potin, Jean Roch Coulon, César Fuguet, Luca Benini, Davide Rossi
Title: CVA6S+: A Superscalar RISC-V Core with High-Throughput Memory Architecture
Abstract:
Open-source RISC-V cores are increasingly adopted in high-end embedded domains such as automotive, where maximizing instructions per cycle (IPC) is becoming critical. Building on the industry-supported open-source CVA6 core and its superscalar variant, CVA6S, we introduce CVA6S+, an enhanced version incorporating improved branch prediction, register renaming and enhanced operand forwarding. These optimizations enable CVA6S+ to achieve a 43.5% performance improvement over the scalar configuration and 10.9% over CVA6S, with an area overhead of just 9.30% over the scalar core (CVA6). Furthermore, we integrate CVA6S+ with the OpenHW Core-V High-Performance L1 Dcache (HPDCache) and report a 74.1% bandwidth improvement over the legacy CVA6 cache subsystem.
中文: 增强版CVA6S+ RISC-V核心通过优化分支预测、寄存器重命名和操作数转发技术,在仅增加9.30%面积开销下实现较基础版本43.5%的性能提升,并配合高性能缓存将带宽提升74.1%。
English: The enhanced CVA6S+ RISC-V core, featuring improved branch prediction, register renaming, and operand forwarding, achieves a 43.5% performance gain over the scalar CVA6 with only 9.30% area overhead and boosts bandwidth by 74.1% when integrated with HPDCache.

Authors:Yanning Hou, Sihang Zhou, Ke Liang, Lingyuan Meng, Xiaoshu Chen, Ke Xu, Siwei Wang, Xinwang Liu, Jian Huang
Title: Soft Reasoning Paths for Knowledge Graph Completion
Abstract:
Reasoning paths are reliable information in knowledge graph completion (KGC) in which algorithms can find strong clues of the actual relation between entities. However, in real-world applications, it is difficult to guarantee that computationally affordable paths exist toward all candidate entities. According to our observation, the prediction accuracy drops significantly when paths are absent. To make the proposed algorithm more stable against the missing path circumstances, we introduce soft reasoning paths. Concretely, a specific learnable latent path embedding is concatenated to each relation to help better model the characteristics of the corresponding paths. The combination of the relation and the corresponding learnable embedding is termed a soft path in our paper. By aligning the soft paths with the reasoning paths, a learnable embedding is guided to learn a generalized path representation of the corresponding relation. In addition, we introduce a hierarchical ranking strategy to make full use of information about the entity, relation, path, and soft path to help improve both the efficiency and accuracy of the model. Extensive experimental results illustrate that our algorithm outperforms the compared state-of-the-art algorithms by a notable margin. The code will be made publicly available after the paper is officially accepted.
中文摘要:本文提出软推理路径和可学习嵌入,以在传统推理路径缺失时增强知识图谱补全的稳定性,并采用分层排序策略显著提升模型的准确性和效率。
English Summary: The paper introduces soft reasoning paths with learnable embeddings to enhance knowledge graph completion stability when traditional reasoning paths are missing, and employs a hierarchical ranking strategy to significantly improve model accuracy and efficiency.

Authors:Yanning Hou, Sihang Zhou, Ke Liang, Lingyuan Meng, Xiaoshu Chen, Ke Xu, Siwei Wang, Xinwang Liu, Jian Huang
Title: Soft Reasoning Paths for Knowledge Graph Completion
Abstract:
Reasoning paths are reliable information in knowledge graph completion (KGC) in which algorithms can find strong clues of the actual relation between entities. However, in real-world applications, it is difficult to guarantee that computationally affordable paths exist toward all candidate entities. According to our observation, the prediction accuracy drops significantly when paths are absent. To make the proposed algorithm more stable against the missing path circumstances, we introduce soft reasoning paths. Concretely, a specific learnable latent path embedding is concatenated to each relation to help better model the characteristics of the corresponding paths. The combination of the relation and the corresponding learnable embedding is termed a soft path in our paper. By aligning the soft paths with the reasoning paths, a learnable embedding is guided to learn a generalized path representation of the corresponding relation. In addition, we introduce a hierarchical ranking strategy to make full use of information about the entity, relation, path, and soft path to help improve both the efficiency and accuracy of the model. Extensive experimental results illustrate that our algorithm outperforms the compared state-of-the-art algorithms by a notable margin. The code will be made publicly available after the paper is officially accepted.
中文摘要:本文提出软推理路径和可学习嵌入,以在传统推理路径缺失时增强知识图谱补全的稳定性,并采用分层排序策略显著提升模型的准确性和效率。
English Summary: The paper introduces soft reasoning paths with learnable embeddings to enhance knowledge graph completion stability when traditional reasoning paths are missing, and employs a hierarchical ranking strategy to significantly improve model accuracy and efficiency.

Authors:Liam Boyle, Nicolas Baumann, Paviththiren Sivasothilingam, Michele Magno, Luca Benini
Title: RobotxR1: Enabling Embodied Robotic Intelligence on Large Language Models through Closed-Loop Reinforcement Learning
Abstract:
Future robotic systems operating in real-world environments will require on-board embodied intelligence without continuous cloud connection, balancing capabilities with constraints on computational power and memory. This work presents an extension of the R1-zero approach, which enables the usage of low parameter-count Large Language Models (LLMs) in the robotic domain. The R1-Zero approach was originally developed to enable mathematical reasoning in LLMs using static datasets. We extend it to the robotics domain through integration in a closed-loop Reinforcement Learning (RL) framework. This extension enhances reasoning in Embodied Artificial Intelligence (Embodied AI) settings without relying solely on distillation of large models through Supervised Fine-Tuning (SFT). We show that small-scale LLMs can achieve effective reasoning performance by learning through closed-loop interaction with their environment, which enables tasks that previously required significantly larger models. In an autonomous driving setting, a performance gain of 20.2%-points over the SFT-based baseline is observed with a Qwen2.5-1.5B model. Using the proposed training procedure, Qwen2.5-3B achieves a 63.3% control adaptability score, surpassing the 58.5% obtained by the much larger, cloud-bound GPT-4o. These results highlight that practical, on-board deployment of small LLMs is not only feasible but can outperform larger models if trained through environmental feedback, underscoring the importance of an interactive learning framework for robotic Embodied AI, one grounded in practical experience rather than static supervision.
中文: 本研究扩展了R1-zero方法,通过闭环强化学习使小型大语言模型在具身人工智能环境中实现有效推理,证明经过环境交互训练的本地部署模型性能可超越云端大型模型。
English: This work extends the R1-zero approach to enable small-scale LLMs to perform effective reasoning in embodied AI settings through closed-loop reinforcement learning, demonstrating that on-board deployment can outperform larger cloud-based models when trained with environmental interaction.

Authors:Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao
Title: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth
Abstract:
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.
中文: 自动驾驶系统的最新进展凸显了世界模型的重要性,但精确的相机姿态控制仍是挑战;轻量级PosePilot框架通过自监督深度和姿态估计增强了可控性,实现了更优的视点合成效果。
English: Recent advances in autonomous driving systems emphasize the importance of world models, yet precise camera pose control remains a challenge, addressed by the lightweight PosePilot framework that enhances controllability through self-supervised depth and pose estimation for improved viewpoint synthesis.

Authors:Thilina Pathirage Don, Aneta Neumann, Frank Neumann
Title: Weighted-Scenario Optimisation for the Chance Constrained Travelling Thief Problem
Abstract:
The chance constrained travelling thief problem (chance constrained TTP) has been introduced as a stochastic variation of the classical travelling thief problem (TTP) in an attempt to embody the effect of uncertainty in the problem definition. In this work, we characterise the chance constrained TTP using a limited number of weighted scenarios. Each scenario represents a similar TTP instance, differing slightly in the weight profile of the items and associated with a certain probability of occurrence. Collectively, the weighted scenarios represent a relaxed form of a stochastic TTP instance where the objective is to maximise the expected benefit while satisfying the knapsack constraint with a larger probability. We incorporate a set of evolutionary algorithms and heuristic procedures developed for the classical TTP, and formulate adaptations that apply to the weighted scenario-based representation of the problem. The analysis focuses on the performance of the algorithms on different settings and examines the impact of uncertainty on the quality of the solutions.
中文: 机会约束旅行窃贼问题通过采用具有不同物品权重和概率的加权场景,将不确定性引入经典问题,旨在利用改进的进化算法和启发式方法,在更高概率满足背包约束的同时最大化期望收益。
English: The chance constrained travelling thief problem introduces uncertainty into the classical version by using weighted scenarios with varying item weights and probabilities, aiming to maximize expected benefit while meeting knapsack constraints with higher probability through adapted evolutionary algorithms and heuristics.

Authors:Xin He, Xumeng Han, Longhui Wei, Lingxi Xie, Qi Tian
Title: Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Abstract:
Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts. Recent work enhances data perception by directly integrating multiple domain-specific vision encoders, yet this structure adds complexity and limits the potential for joint optimization. In this paper, we introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder while being restructured into a multi-expert paradigm for task-specific fine-tuning across different visual tasks. Additionally, we design a dynamic routing mechanism that allocates input images to the most suitable visual expert. Mixpert effectively alleviates domain conflicts encountered by a single vision encoder in multi-task learning with minimal additional computational cost, making it more efficient than multiple encoders. Furthermore, Mixpert integrates seamlessly into any MLLM, with experimental results demonstrating substantial performance gains across various tasks.
中文摘要:Mixpert是一种高效的视觉专家混合架构,通过动态路由机制将输入图像分配给最合适的视觉专家,有效缓解了多任务学习中单视觉编码器的领域冲突问题,并以最小计算成本显著提升各项任务性能。
English Summary: Mixpert is an efficient mixture-of-vision-experts architecture that dynamically routes images to specialized experts, resolving domain conflicts in multimodal language models while maintaining computational efficiency and boosting task performance.

Authors:Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui
Title: SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
Abstract:
Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation for query-key pairs, we adopt our Relative Attention Score metric, which offers significantly higher efficiency within our framework. We implement a custom CUDA kernel optimized for our approach for hardware efficiency, reducing the additional overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.
中文: SALE提出了一种细粒度稀疏注意力方法,通过4位量化和块稀疏计算来加速大语言模型的长上下文预填充阶段,在保持模型精度的同时实现超过3.36倍的加速效果,且无需重新训练模型参数。
English: SALE introduces a fine-grained sparse attention method using 4-bit quantization and block-sparse computation to accelerate LLM long-context prefilling with minimal accuracy loss, achieving over 3.36x speedup for sequences beyond 64K tokens without requiring retraining.

Authors:Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu
Title: ZeroSep: Separate Anything in Audio with Zero Training
Abstract:
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
中文摘要:ZeroSep是一种创新的零样本音频源分离方法,通过将预训练的文本引导音频扩散模型重新用于分离任务,无需特定训练即可利用文本条件引导去噪过程,在多个基准测试中展现出卓越性能。
English Summary: ZeroSep is a novel zero-shot audio source separation method that repurposes pre-trained text-guided audio diffusion models without task-specific training, achieving strong performance by inverting mixed audio into latent space and using text conditioning to guide denoising.

Authors:Zenghui Yuan, Yangming Xu, Jiawen Shi, Pan Zhou, Lichao Sun
Title: Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models
Abstract:
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).
中文: 本文提出Merge Hijacking攻击方法,针对大语言模型融合过程植入后门,能在保持模型多任务性能的同时实现有效攻击,并证实其对多种防御措施和实际应用场景均具有突破能力。
English: This paper introduces Merge Hijacking, a novel backdoor attack on LLM model merging that embeds malicious triggers into merged models while preserving their utility across tasks, proving effective against various defenses and real-world applications.

Authors:Zhejian Yang, Yongchao Chen, Xueyang Zhou, Jiangyue Yan, Dingjie Song, Yinuo Liu, Yuting Li, Yu Zhang, Pan Zhou, Hechang Chen, Lichao Sun
Title: Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents
Abstract:
Long-horizon robotic manipulation poses significant challenges for autonomous systems, requiring extended reasoning, precise execution, and robust error recovery across complex sequential tasks. Current approaches, whether based on static planning or end-to-end visuomotor policies, suffer from error accumulation and lack effective verification mechanisms during execution, limiting their reliability in real-world scenarios. We present Agentic Robot, a brain-inspired framework that addresses these limitations through Standardized Action Procedure (SAP)--a novel coordination protocol governing component interactions throughout manipulation tasks. Drawing inspiration from Standardized Operating Procedures (SOPs) in human organizations, SAP establishes structured workflows for planning, execution, and verification phases. Our architecture comprises three specialized components: (1) a large reasoning model that decomposes high-level instructions into semantically coherent subgoals, (2) a vision-language-action executor that generates continuous control commands from real-time visual inputs, and (3) a temporal verifier that enables autonomous progression and error recovery through introspective assessment. This SAP-driven closed-loop design supports dynamic self-verification without external supervision. On the LIBERO benchmark, Agentic Robot achieves state-of-the-art performance with an average success rate of 79.6%, outperforming SpatialVLA by 6.1% and OpenVLA by 7.4% on long-horizon tasks. These results demonstrate that SAP-driven coordination between specialized components enhances both performance and interpretability in sequential manipulation, suggesting significant potential for reliable autonomous systems. Project Github: https://agentic-robot.github.io.
Chinese: Agentic Robot框架采用受大脑启发的标准化行动规程(SAP),通过协调规划、执行与验证三个专业组件解决长周期机器人操作中的误差累积问题,在LIBERO基准测试中以79.6%的成功率刷新性能纪录。
English: The Agentic Robot framework introduces a brain-inspired Standardized Action Procedure (SAP) that coordinates specialized components for planning, execution, and verification to overcome error accumulation in long-horizon robotic manipulation, achieving state-of-the-art 79.6% success on the LIBERO benchmark.

Authors:Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
Title: The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence
Abstract:
Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture -- e.g., Conformer or Branchformer -- are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.
Chinese: 本研究发现大规模语音转文本训练需要采用次指数学习率预热,其中较高的初始学习率虽能加速早期收敛,但无法提升最终性能。
English: This study finds that large-scale speech-to-text training requires a sub-exponential learning rate warmup, where a higher initial rate speeds up early convergence but does not enhance final performance.

Authors:Chongjie Si, Zhiyi Shi, Yadao Wang, Xiaokang Yang, Susanto Rahardja, Wei Shen
Title: MAP: Revisiting Weight Decomposition for Low-Rank Adaptation
Abstract:
The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.
中文: 提出的MAP框架通过向量归一化和独立缩放,将权重适应严格解耦为方向和幅度分量,显著提升了现有参数高效微调方法的性能与可解释性。
English: The proposed MAP framework rigorously decouples weight adaptation into direction and magnitude components through vector normalization and independent scaling, significantly enhancing existing parameter-efficient fine-tuning methods with improved performance and interpretability.

Authors:Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard
Title: BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models
Abstract:
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a $42\%$ confusion rate.
中文: 提出的BinauralFlow框架采用流匹配和因果U-Net架构,实现了近乎媲美真实录音的高质量流式双耳音频合成,有效解决了渲染质量和实时推理的关键难题。
English: The proposed BinauralFlow framework uses flow matching and a causal U-Net architecture to achieve high-quality, streaming binaural audio synthesis that is nearly indistinguishable from real recordings, addressing key challenges in rendering quality and real-time inference.

Authors:Sara Papi, Marco Gaido, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
Title: FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
Abstract:
The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.
中文:FAMA作为首个开放科学语音基础模型系列,基于超过15万小时开源数据和1.6万小时清洗数据集开发,在保持竞争优势的同时实现8倍加速,并通过开源许可发布全部资源以推动语音技术的透明发展。
English: The FAMA family of open science speech foundation models, trained on over 150,000 hours of open-source data and a new 16,000-hour cleaned dataset, achieves competitive performance with up to 8 times faster speed while releasing all artifacts under open-source licenses to advance transparency in speech technology.

Authors:Marcus J. Vroemen, Yuqian Chen, Yui Lo, Tengfei Xue, Weidong Cai, Fan Zhang, Josien P. W. Pluim, Lauren J. O'Donnell
Title: DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography
Abstract:
Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation while supporting multiple parcellation schemes. Using a point-cloud-based neural network with multi-task learning, the model classifies streamlines according to their connected regions across two parcellation schemes, sharing a learned representation. We train and validate DeepMultiConnectome on tractography from the Human Connectome Project Young Adult dataset ($n = 1000$), labeled with an 84 and 164 region gray matter parcellation scheme. DeepMultiConnectome predicts multiple structural connectomes from a whole-brain tractogram containing 3 million streamlines in approximately 40 seconds. DeepMultiConnectome is evaluated by comparing predicted connectomes with traditional connectomes generated using the conventional method of labeling streamlines using a gray matter parcellation. The predicted connectomes are highly correlated with traditionally generated connectomes ($r = 0.992$ for an 84-region scheme; $r = 0.986$ for a 164-region scheme) and largely preserve network properties. A test-retest analysis of DeepMultiConnectome demonstrates reproducibility comparable to traditionally generated connectomes. The predicted connectomes perform similarly to traditionally generated connectomes in predicting age and cognitive function. Overall, DeepMultiConnectome provides a scalable, fast model for generating subject-specific connectomes across multiple parcellation schemes.
中文: DeepMultiConnectome是一种深度学习模型,可直接从纤维束成像快速预测结构连接组,无需灰质分区即可支持多种分区方案,其准确性和效率与传统方法相当。
English: DeepMultiConnectome is a deep-learning model that rapidly predicts structural connectomes directly from tractography, eliminating the need for gray matter parcellation while supporting multiple schemes and demonstrating high accuracy and efficiency comparable to traditional methods.

Authors:Yongcan Yu, Yanbo Wang, Ran He, Jian Liang
Title: Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models
Abstract:
While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.
中文: 尽管大型语言模型能力卓越,却易受越狱攻击,为此我们提出了TIM这一通用防御框架,它能自适应地检测并抵御多种威胁,通过自我进化的安全微调来应对攻击,同时确保检测性能不受影响。
English: Despite their advanced capabilities, large language models are susceptible to jailbreak attacks, prompting the development of TIM, a universal defense framework that adaptively detects and mitigates diverse threats through self-evolving safety fine-tuning while maintaining detection performance.

Authors:Esra Adiyeke, Tianqi Liu, Venkata Sai Dheeraj Naganaboina, Han Li, Tyler J. Loftus, Yuanfang Ren, Benjamin Shickel, Matthew M. Ruppert, Karandeep Singh, Ruogu Fang, Parisa Rashidi, Azra Bihorac, Tezcan Ozrazgat-Baslanti
Title: Learning optimal treatment strategies for intraoperative hypotension using deep reinforcement learning
Abstract:
Traditional methods of surgical decision making heavily rely on human experience and prompt actions, which are variable. A data-driven system generating treatment recommendations based on patient states can be a substantial asset in perioperative decision-making, as in cases of intraoperative hypotension, for which suboptimal management is associated with acute kidney injury (AKI), a common and morbid postoperative complication. We developed a Reinforcement Learning (RL) model to recommend optimum dose of intravenous (IV) fluid and vasopressors during surgery to avoid intraoperative hypotension and postoperative AKI. We retrospectively analyzed 50,021 surgeries from 42,547 adult patients who underwent major surgery at a quaternary care hospital between June 2014 and September 2020. Of these, 34,186 surgeries were used for model training and 15,835 surgeries were reserved for testing. We developed a Deep Q-Networks based RL model using 16 variables including intraoperative physiologic time series, total dose of IV fluid and vasopressors extracted for every 15-minute epoch. The model replicated 69% of physician's decisions for the dosage of vasopressors and proposed higher or lower dosage of vasopressors than received in 10% and 21% of the treatments, respectively. In terms of IV fluids, the model's recommendations were within 0.05 ml/kg/15 min of the actual dose in 41% of the cases, with higher or lower doses recommended for 27% and 32% of the treatments, respectively. The model resulted in a higher estimated policy value compared to the physicians' actual treatments, as well as random and zero-drug policies. AKI prevalence was the lowest in patients receiving medication dosages that aligned with model's decisions. Our findings suggest that implementation of the model's policy has the potential to reduce postoperative AKI and improve other outcomes driven by intraoperative hypotension.
中文摘要:研究开发了一种强化学习模型,用于在手术期间推荐最佳静脉输液和血管加压药物剂量,相比传统临床决策,该模型能更有效管理术中低血压,有望降低术后急性肾损伤的发生率。
English Summary: A Reinforcement Learning model was developed to recommend optimal intravenous fluid and vasopressor doses during surgery, showing potential to reduce postoperative acute kidney injury by improving intraoperative hypotension management compared to traditional physician decisions.

Authors:Chongjie Si, Yidan Cui, Fuchao Yang, Xiaokang Yang, Wei Shen
Title: Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning
Abstract:
Partial Multi-Label Learning (PML) extends the multi-label learning paradigm to scenarios where each sample is associated with a candidate label set containing both ground-truth labels and noisy labels. Existing PML methods commonly rely on two assumptions: sparsity of the noise label matrix and low-rankness of the ground-truth label matrix. However, these assumptions are inherently conflicting and impractical for real-world scenarios, where the true label matrix is typically full-rank or close to full-rank. To address these limitations, we demonstrate that the sparsity constraint contributes to the high-rank property of the predicted label matrix. Based on this, we propose a novel method Schirn, which introduces a sparsity constraint on the noise label matrix while enforcing a high-rank property on the predicted label matrix. Extensive experiments demonstrate the superior performance of Schirn compared to state-of-the-art methods, validating its effectiveness in tackling real-world PML challenges.
中文:提出的Schirn方法通过约束噪声标签的稀疏性同时强制预测标签矩阵保持高秩特性,解决了部分多标签学习中的关键局限,在实际应用中展现出优越性能。
English: The proposed Schirn method addresses limitations in Partial Multi-Label Learning by applying sparsity constraints to noise labels while enforcing high-rank properties on predicted labels, demonstrating superior performance in real-world scenarios.

Authors:Sirui Xia, Aili Chen, Xintao Wang, Tinghui Zhu, Yikai Zhang, Jiangjie Chen, Yanghua Xiao
Title: Can LLMs Learn to Map the World from Local Descriptions?
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
中文摘要:大型语言模型能够从碎片化的局部观察中构建连贯的全局空间认知,在完成空间感知与导航任务的同时,还能泛化至未见过的空间关系并展现出与现实世界一致的空间表征能力。
English Summary: Large language models demonstrate the ability to construct coherent global spatial cognition from fragmented local observations, effectively performing spatial perception and navigation tasks while generalizing to unseen relationships and exhibiting real-world aligned representations.

Authors:Zhengliang Shi, Lingyong Yan, Dawei Yin, Suzan Verberne, Maarten de Rijke, Zhaochun Ren
Title: Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers
Abstract:
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.
中文: EXSEARCH是一种代理搜索框架,通过自激励学习过程使大语言模型能够迭代检索和优化信息,在复杂知识任务中显著提升性能表现。
English: EXSEARCH is an agentic search framework that enables large language models to iteratively retrieve and refine information through a self-incentivized learning process, significantly improving performance on complex knowledge tasks.

Authors:Hao Wu, Yuan Gao, Ruiqi Shu, Kun Wang, Ruijian Gou, Chuhan Wu, Xinliang Liu, Juncai He, Shuhao Cao, Junfeng Fang, Xingjian Shi, Feng Tao, Qi Song, Shengxuan Ji, Yanfei Xiang, Yuze Sun, Jiahao Li, Fan Xu, Huanshuo Dong, Haixin Wang, Fan Zhang, Penghao Zhao, Xian Wu, Qingsong Wen, Deliang Chen, Xiaomeng Huang
Title: Advanced long-term earth system forecasting by learning the small-scale nature
Abstract:
Reliable long-term forecast of Earth system dynamics is heavily hampered by instabilities in current AI models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. We present Triton, an AI framework designed to address this fundamental challenge. Inspired by increasing grids to explicitly resolve small scales in numerical models, Triton employs a hierarchical architecture processing information across multiple resolutions to mitigate spectral bias and explicitly model cross-scale dynamics. We demonstrate Triton's superior performance on challenging forecast tasks, achieving stable year-long global temperature forecasts, skillful Kuroshio eddy predictions till 120 days, and high-fidelity turbulence simulations preserving fine-scale structures all without external forcing, with significantly surpassing baseline AI models in long-term stability and accuracy. By effectively suppressing high-frequency error accumulation, Triton offers a promising pathway towards trustworthy AI-driven simulation for climate and earth system science.
中文: Triton人工智能框架通过分层多分辨率架构解决频谱偏差问题,实现了对全球气温、黑潮涡旋和湍流的高精度长期稳定预测,显著提升了地球系统模拟的可信度。
English: Triton is an AI framework that overcomes spectral bias in autoregressive forecasting by using a hierarchical multi-resolution architecture, enabling stable long-term predictions of climate phenomena and turbulence with superior accuracy.

Authors:Jiajun Zhu, Ye Liu, Meikai Bao, Kai Zhang, Yanghai Zhang, Qi Liu
Title: Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering
Abstract:
Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.
中文:提出的自反思规划框架通过迭代式、参考引导的推理将大语言模型与知识图谱协同,有效减少幻觉并增强事实一致性,在多个数据集上超越现有强基线模型。
English: The proposed Self-Reflective Planning framework synergizes large language models with knowledge graphs through iterative, reference-guided reasoning to mitigate hallucinations and enhance factual consistency, outperforming strong baselines across multiple datasets.

Authors:Xin Ma, Yaohui Wang, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Title: Training-free Stylized Text-to-Image Generation with Fast Inference
Abstract:
Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.
中文摘要:OmniPainter方法无需微调预训练扩散模型,通过提取参考图像风格特征并采用新型自注意力机制,实现了高效且高质量的图像风格化生成。
English Summary: The proposed OmniPainter method enables efficient stylized image generation using pre-trained diffusion models without fine-tuning, by extracting style statistics from reference images and employing a novel self-attention mechanism to produce high-quality stylized outputs.

Authors:Hao Wu, Yuan Gao, Ruiqi Shu, Zean Han, Fan Xu, Zhihong Zhu, Qingsong Wen, Xian Wu, Kun Wang, Xiaomeng Huang
Title: Turb-L1: Achieving Long-term Turbulence Tracing By Tackling Spectral Bias
Abstract:
Accurately predicting the long-term evolution of turbulence is crucial for advancing scientific understanding and optimizing engineering applications. However, existing deep learning methods face significant bottlenecks in long-term autoregressive prediction, which exhibit excessive smoothing and fail to accurately track complex fluid dynamics. Our extensive experimental and spectral analysis of prevailing methods provides an interpretable explanation for this shortcoming, identifying Spectral Bias as the core obstacle. Concretely, spectral bias is the inherent tendency of models to favor low-frequency, smooth features while overlooking critical high-frequency details during training, thus reducing fidelity and causing physical distortions in long-term predictions. Building on this insight, we propose Turb-L1, an innovative turbulence prediction method, which utilizes a Hierarchical Dynamics Synthesis mechanism within a multi-grid architecture to explicitly overcome spectral bias. It accurately captures cross-scale interactions and preserves the fidelity of high-frequency dynamics, enabling reliable long-term tracking of turbulence evolution. Extensive experiments on the 2D turbulence benchmark show that Turb-L1 demonstrates excellent performance: (I) In long-term predictions, it reduces Mean Squared Error (MSE) by $80.3\%$ and increases Structural Similarity (SSIM) by over $9\times$ compared to the SOTA baseline, significantly improving prediction fidelity. (II) It effectively overcomes spectral bias, accurately reproducing the full enstrophy spectrum and maintaining physical realism in high-wavenumber regions, thus avoiding the spectral distortions or spurious energy accumulation seen in other methods.
中文摘要:本研究揭示了谱偏差是深度学习模型在长期湍流预测中的核心障碍,并提出Turb-L1多网格方法,通过显式克服谱偏差实现了对湍流演化更精确的长期跟踪和物理保真度。
English Summary: The study identifies Spectral Bias as the key limitation in deep learning models for long-term turbulence prediction and introduces Turb-L1, a multi-grid method that overcomes this bias to achieve superior accuracy and physical fidelity in tracking turbulent evolution.

Authors:Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo
Title: On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts
Abstract:
The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.
中文: 本研究探讨了softmax污染混合专家模型中门控和提示参数的最大似然估计收敛速率,发现参数可估计性关键取决于预训练模型与提示模型之间的可区分性,在可区分条件下可获得最优收敛速率,而条件不满足时收敛显著变慢。
English: This study investigates the convergence rates of maximum likelihood estimators for gating and prompt parameters in softmax-contaminated mixture of experts models, revealing that parameter estimability depends critically on the distinguishability between pre-trained and prompt models, with optimal rates achieved under distinguishability and slower rates when this condition fails.

Authors:Yizhou Xu, Florent Krzakala, Lenka Zdeborová
Title: Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions
Abstract:
The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a constant number of hidden units. In this limit, we simplify the standard RBM training objective into a form that is equivalent to the multi-index model with non-separable regularization. This opens a path to analyze training of the RBM using methods that are established for multi-index models, such as Approximate Message Passing (AMP) and its state evolution, and the analysis of Gradient Descent (GD) via the dynamical mean-field theory. We then give rigorous asymptotics of the training dynamics of RBM on data generated by the spiked covariance model as a prototype of a structure suitable for unsupervised learning. We show in particular that RBM reaches the optimal computational weak recovery threshold, aligning with the BBP transition, in the spiked covariance model.
中文: 本研究在高维条件下简化了受限玻尔兹曼机的训练目标,使其可通过多索引模型方法进行分析,并证明其在尖峰协方差模型中达到与BBP转变一致的最优性能。
English: The study simplifies the Restricted Boltzmann Machine's training objective in high-dimensional settings, enabling analysis through multi-index model methods and demonstrating its optimal performance matching the BBP transition in the spiked covariance model.

Authors:Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Florent Krzakala
Title: The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
Abstract:
We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.
中文: 本研究通过将过参数化二次激活神经网络映射为凸矩阵感知问题,揭示了网络能力控制源于特征映射的低秩结构,并建立了精确的泛化阈值,阐明了目标函数宽度对可学习性的决定性作用。
English: This research analyzes the high-dimensional behavior of over-parametrized two-layer neural networks with quadratic activations, revealing that their capacity control stems from low-rank structures in feature maps and establishing precise generalization thresholds through connections to convex matrix sensing.

Authors:Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
Title: SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator
Abstract:
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at https://auto-safe.github.io/.
中文: AutoSafe作为首个自动化合成数据生成框架,系统性地提升了基于大语言模型的智能体安全性,实验显示其安全评分平均提高45%,并在真实任务中表现出卓越的泛化能力。
English: AutoSafe is a pioneering framework that automatically generates synthetic data to systematically enhance the safety of LLM-based agents, significantly improving safety scores by 45% on average and demonstrating strong generalization in real-world applications.

Authors:Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao
Title: Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration
Abstract:
Reinforcement learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for multi-step reasoning processes. Specifically, sparse reward signals fail to deliver effective or sufficient feedback, particularly for challenging problems. Furthermore, such reward structures induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across ipntermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a novel method designed to both deliver dense rewards and amplify explorations in the RL-based training paradigm. i-MENTOR introduces three key innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; dynamic reward scaling to stabilize exploration and exploitation in large action spaces; and advantage-preserving reward implementation that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across three public datasets demonstrate i-MENTOR's effectiveness with a 22.39% improvement on the difficult dataset Countdown-4.
中文摘要:强化学习方法在提升大语言模型推理能力时,常因稀疏奖励和探索机制不足而效率低下;提出的i-MENTOR方法通过密集奖励和增强探索机制有效解决这些问题,实现了显著的性能提升。
English Summary: Reinforcement Learning methods for Large Language Models often struggle with sparse rewards and limited exploration, leading to inefficiencies in reasoning tasks; the proposed i-MENTOR method addresses these issues by providing dense rewards and enhanced exploration mechanisms, achieving significant performance improvements.

Authors:Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao
Title: Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration
Abstract:
Reinforcement Learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for reasoning. Specifically, sparse rewards fail to deliver sufficient feedback, particularly for challenging problems. Furthermore, such rewards induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across intermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a method designed to deliver dense rewards and amplify exploration in the RL-based paradigm. i-MENTOR introduces three innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; error-conditioned reward allocation to ensure efficient exploration on challenging samples while intrinsically stabilizing training; and advantage-preserving integration that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23\% improvement on AIME 2024.
中文摘要:强化学习方法在提升大语言模型推理能力时,常因稀疏奖励和探索机制不足而效率低下;提出的i-MENTOR方法通过密集奖励和增强探索机制有效解决这些问题,实现了显著的性能提升。
English Summary: Reinforcement Learning methods for Large Language Models often struggle with sparse rewards and limited exploration, leading to inefficiencies in reasoning tasks; the proposed i-MENTOR method addresses these issues by providing dense rewards and enhanced exploration mechanisms, achieving significant performance improvements.

Authors:Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang
Title: AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models
Abstract:
LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.
中文摘要:AdaReasoner是一款基于强化学习的自适应推理配置插件,能自动优化大语言模型在不同任务中的推理参数,在提升各类任务性能的同时保持出色的泛化能力。
English Summary: AdaReasoner is an adaptive plugin that uses reinforcement learning to automatically optimize reasoning configurations for LLMs, enhancing performance across diverse tasks while maintaining robustness.

Authors:Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Leibny Paola Garcia, Eng Siong Chng
Title: Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
Abstract:
The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.
中文: 本研究探讨不同帧率对汉语和英语语音标记化的影响,发现由于语音密度和声学特征的差异,帧率变化对两种语言产生不同作用,为语音识别与合成等应用的帧率优化提供了指导。
English: The study examines how varying frame rates affect speech tokenization in Mandarin and English, revealing language-specific impacts due to differences in phonetic density and acoustic features, which informs optimal frame rate selection for speech applications like recognition and synthesis.

Authors:Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hechang Wang, Pan Zhou, Lichao Sun
Title: BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization
Abstract:
Vision-Language-Action (VLA) models have advanced robotic control by enabling end-to-end decision-making directly from multimodal inputs. However, their tightly coupled architectures expose novel security vulnerabilities. Unlike traditional adversarial perturbations, backdoor attacks represent a stealthier, persistent, and practically significant threat-particularly under the emerging Training-as-a-Service paradigm-but remain largely unexplored in the context of VLA models. To address this gap, we propose BadVLA, a backdoor attack method based on Objective-Decoupled Optimization, which for the first time exposes the backdoor vulnerabilities of VLA models. Specifically, it consists of a two-stage process: (1) explicit feature-space separation to isolate trigger representations from benign inputs, and (2) conditional control deviations that activate only in the presence of the trigger, while preserving clean-task performance. Empirical results on multiple VLA benchmarks demonstrate that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy. Further analyses confirm its robustness against common input perturbations, task transfers, and model fine-tuning, underscoring critical security vulnerabilities in current VLA deployments. Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models, highlighting an urgent need for secure and trustworthy embodied model design practices. We have released the project page at https://badvla-project.github.io/.
中文: BadVLA通过目标解耦优化方法首次揭示了视觉-语言-动作模型的后门安全漏洞,在保持正常任务性能的同时实现了接近100%的攻击成功率。
English: BadVLA introduces a novel backdoor attack method using Objective-Decoupled Optimization to expose critical security vulnerabilities in Vision-Language-Action models, achieving near-perfect attack success while maintaining clean task performance.

Authors:Guiyao Tie, Xueyang Zhou, Tianhe Gu, Ruihang Zhang, Chaoran Hu, Sizhe Zhang, Mengqu Sun, Yan Zhang, Pan Zhou, Lichao Sun
Title: MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning
Abstract:
Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMLU-Reason, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMLU-Reason comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMLU-Reason offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.
中文: MMLU-Reason基准旨在严格评估具有显性思维的多模态推理能力,结果表明尽管带思维轨迹的多模态大模型表现更优,但仍存在推理不一致性及准确性与推理质量之间的差距。
English: The MMLU-Reason benchmark is introduced to rigorously evaluate multi-modal reasoning with explicit thinking, revealing that while MLLMs with thinking traces outperform others, they still suffer from reasoning inconsistencies and gaps between accuracy and reasoning quality.

Authors:Chongjie Si, Yidan Cui, Fuchao Yang, Xiaokang Yang, Wei Shen
Title: Why Can Accurate Models Be Learned from Inaccurate Annotations?
Abstract:
Learning from inaccurate annotations has gained significant attention due to the high cost of precise labeling. However, despite the presence of erroneous labels, models trained on noisy data often retain the ability to make accurate predictions. This intriguing phenomenon raises a fundamental yet largely unexplored question: why models can still extract correct label information from inaccurate annotations remains unexplored. In this paper, we conduct a comprehensive investigation into this issue. By analyzing weight matrices from both empirical and theoretical perspectives, we find that label inaccuracy primarily accumulates noise in lower singular components and subtly perturbs the principal subspace. Within a certain range, the principal subspaces of weights trained on inaccurate labels remain largely aligned with those learned from clean labels, preserving essential task-relevant information. We formally prove that the angles of principal subspaces exhibit minimal deviation under moderate label inaccuracy, explaining why models can still generalize effectively. Building on these insights, we propose LIP, a lightweight plug-in designed to help classifiers retain principal subspace information while mitigating noise induced by label inaccuracy. Extensive experiments on tasks with various inaccuracy conditions demonstrate that LIP consistently enhances the performance of existing algorithms. We hope our findings can offer valuable theoretical and practical insights to understand of model robustness under inaccurate supervision.
中文: 模型之所以能从有误差的标注中有效学习,是因为标签错误主要影响低阶奇异分量而保持主成分子空间的对齐性,基于此研发的LIP轻量插件通过维护关键子空间信息,显著提升了现有算法在噪声标注下的性能表现。
English: Models can still generalize effectively from inaccurate labels because label errors mainly affect lower singular components while preserving the principal subspace alignment, leading to the development of LIP, a lightweight plug-in that enhances classifier performance by maintaining this critical information.

Authors:Yuhang Wang, Youhe Jiang, Bin Cui, Fangcheng Fu
Title: Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
Abstract:
Recent advances in test-time scaling suggest that Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning (analogous to human thinking) to respond a given request, and meanwhile exploring more reasoning branches (i.e., generating multiple responses and ensembling them) can improve the final output quality. However, when incorporating the two scaling dimensions, we find that the system efficiency is dampened significantly for two reasons. Firstly, the time cost to generate the final output increases substantially as many reasoning branches would be trapped in the over-thinking dilemma, producing excessively long responses. Secondly, generating multiple reasoning branches for each request increases memory consumption, which is unsuitable for LLM serving since we can only batch a limited number of requests to process simultaneously. To address this, we present SART, a serving framework for efficient and accurate LLM reasoning. The essential idea is to manage the thinking to be short and right, rather than long. For one thing, we devise a redundant sampling with early stopping approach based on empirical observations and theoretic analysis, which increases the likelihood of obtaining short-thinking responses when sampling reasoning branches. For another, we propose to dynamically prune low-quality branches so that only right-thinking branches are maintained, reducing the memory consumption and allowing us to batch more requests. Experimental results demonstrate that SART not only improves the accuracy of LLM reasoning but also enhances the serving efficiency, outperforming existing methods by up to 28.2 times and on average 15.7 times in terms of efficiency when achieving the same level of accuracy.
中文:近期研究提出SART框架,通过生成简短优质思维链并剪枝低质量分支,显著提升大语言模型推理效率与准确性,服务性能优化高达28.2倍。
English: Recent research introduces SART, a framework that enhances LLM reasoning efficiency by generating shorter, high-quality thought chains and pruning low-quality branches, significantly improving both accuracy and serving performance.

Authors:Tonglong Wei, Yan Lin, Zeyu Zhou, Haomin Wen, Jilin Hu, Shengnan Guo, Youfang Lin, Gao Cong, Huaiyu Wan
Title: TransferTraj: A Vehicle Trajectory Learning Model for Region and Task Transferability
Abstract:
Vehicle GPS trajectories provide valuable movement information that supports various downstream tasks and applications. A desirable trajectory learning model should be able to transfer across regions and tasks without retraining, avoiding the need to maintain multiple specialized models and subpar performance with limited training data. However, each region has its unique spatial features and contexts, which are reflected in vehicle movement patterns and difficult to generalize. Additionally, transferring across different tasks faces technical challenges due to the varying input-output structures required for each task. Existing efforts towards transferability primarily involve learning embedding vectors for trajectories, which perform poorly in region transfer and require retraining of prediction modules for task transfer. To address these challenges, we propose TransferTraj, a vehicle GPS trajectory learning model that excels in both region and task transferability. For region transferability, we introduce RTTE as the main learnable module within TransferTraj. It integrates spatial, temporal, POI, and road network modalities of trajectories to effectively manage variations in spatial context distribution across regions. It also introduces a TRIE module for incorporating relative information of spatial features and a spatial context MoE module for handling movement patterns in diverse contexts. For task transferability, we propose a task-transferable input-output scheme that unifies the input-output structure of different tasks into the masking and recovery of modalities and trajectory points. This approach allows TransferTraj to be pre-trained once and transferred to different tasks without retraining. Extensive experiments on three real-world vehicle trajectory datasets under task transfer, zero-shot, and few-shot region transfer, validating TransferTraj's effectiveness.
中文: TransferTraj是一种车辆GPS轨迹学习模型,通过整合多种轨迹模态和统一输入输出方案,实现了卓越的跨区域和跨任务迁移能力,无需重新训练即可适应不同场景。
English: TransferTraj is a vehicle GPS trajectory learning model designed for superior cross-region and cross-task transferability, integrating multiple trajectory modalities and a unified input-output scheme to eliminate the need for retraining.

Authors:Xukai Liu, Ye Liu, Shiwen Wu, Yanghai Zhang, Yihao Yuan, Kai Zhang, Qi Liu
Title: Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering
Abstract:
Recent advances in large language models (LLMs) have led to impressive progress in natural language generation, yet their tendency to produce hallucinated or unsubstantiated content remains a critical concern. To improve factual reliability, Retrieval-Augmented Generation (RAG) integrates external knowledge during inference. However, existing RAG systems face two major limitations: (1) unreliable adaptive control due to limited external knowledge supervision, and (2) hallucinations caused by inaccurate or irrelevant references. To address these issues, we propose Know3-RAG, a knowledge-aware RAG framework that leverages structured knowledge from knowledge graphs (KGs) to guide three core stages of the RAG process, including retrieval, generation, and filtering. Specifically, we introduce a knowledge-aware adaptive retrieval module that employs KG embedding to assess the confidence of the generated answer and determine retrieval necessity, a knowledge-enhanced reference generation strategy that enriches queries with KG-derived entities to improve generated reference relevance, and a knowledge-driven reference filtering mechanism that ensures semantic alignment and factual accuracy of references. Experiments on multiple open-domain QA benchmarks demonstrate that Know3-RAG consistently outperforms strong baselines, significantly reducing hallucinations and enhancing answer reliability.
Chinese: Know3-RAG框架通过利用知识图谱中的结构化知识来指导检索、生成和过滤过程,显著减少了开放域问答任务中的幻觉问题,并提升了答案的可靠性。
English: The Know3-RAG framework enhances Retrieval-Augmented Generation by integrating structured knowledge from knowledge graphs to guide retrieval, generation, and filtering processes, effectively reducing hallucinations and improving answer reliability in open-domain QA tasks.

Authors:Md Mehrab Tanjim, Yeonjun In, Xiang Chen, Victor S. Bursztyn, Ryan A. Rossi, Sungchul Kim, Guang-Jie Ren, Vaishnavi Muppala, Shun Jiang, Yongsung Kim, Chanyoung Park
Title: Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey
Abstract:
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, especially in agentic settings, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable LLM-based systems.
中文: 本文探讨了自然语言处理中歧义问题的挑战,特别是在大型语言模型用于对话问答时,通过定义核心概念、分类消歧方法并指出未来研究方向,旨在提升基于大模型的系统鲁棒性。
English: This paper examines the challenges of ambiguity in Natural Language Processing, particularly for Large Language Models in Conversational Question Answering, by defining key concepts, categorizing disambiguation methods, and identifying future research directions to enhance system reliability.

Authors:Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, Shing-Chi Cheung
Title: IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems
Abstract:
The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black-box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query $q$ and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query $q$ to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real-world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.
中文: 本文提出MASLEAK攻击框架,通过精心设计的对抗性查询从多智能体系统中提取敏感知识产权,能够以最高92%的成功率揭露系统架构、提示词和配置信息。
English: This paper introduces MASLEAK, a black-box attack framework that extracts sensitive intellectual property from Multi-Agent Systems by crafting adversarial queries to reveal system architecture, prompts, and configurations, achieving up to 92% success rates in experiments.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: ExpertSteer: Intervening in LLMs through Expert Knowledge
Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.
中文: ExpertSteer是一种创新方法,通过利用外部专家模型生成的引导向量来控制任何大型语言模型,无需更新模型参数即可显著提升多项任务的性能表现。
English: ExpertSteer is a novel method that enables control over any large language model by using steering vectors generated from external expert models, significantly improving performance across various tasks without updating model parameters.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: ExpertSteer: Intervening in LLMs through Expert Knowledge
Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.
中文: ExpertSteer是一种创新方法,通过利用外部专家模型生成的引导向量来控制任何大型语言模型,无需更新模型参数即可显著提升多项任务的性能表现。
English: ExpertSteer is a novel method that enables control over any large language model by using steering vectors generated from external expert models, significantly improving performance across various tasks without updating model parameters.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Abstract:
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
中文: HBO提出了一种分层优化方法,使大语言模型能够在微调过程中自主实现全局跨数据集和局部数据集内部的数据分配平衡,始终优于现有基线并显著提升准确率。
English: HBO introduces a hierarchical optimization method that enables LLMs to autonomously balance data allocation both globally across datasets and locally within each dataset during fine-tuning, consistently outperforming existing baselines with significant accuracy gains.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Abstract:
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
中文: HBO提出了一种分层优化方法,使大语言模型能够在微调过程中自主实现全局跨数据集和局部数据集内部的数据分配平衡,始终优于现有基线并显著提升准确率。
English: HBO introduces a hierarchical optimization method that enables LLMs to autonomously balance data allocation both globally across datasets and locally within each dataset during fine-tuning, consistently outperforming existing baselines with significant accuracy gains.

Authors:Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Title: Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation
Abstract:
Considering the inherent limitations of parametric knowledge in large language models (LLMs), retrieval-augmented generation (RAG) is widely employed to expand their knowledge scope. Since RAG has shown promise in knowledge-intensive tasks like open-domain question answering, its broader application to complex tasks and intelligent assistants has further advanced its utility. Despite this progress, the underlying knowledge utilization mechanisms of LLM-based RAG remain underexplored. In this paper, we present a systematic investigation of the intrinsic mechanisms by which LLMs integrate internal (parametric) and external (retrieved) knowledge in RAG scenarios. Specially, we employ knowledge stream analysis at the macroscopic level, and investigate the function of individual modules at the microscopic level. Drawing on knowledge streaming analyses, we decompose the knowledge utilization process into four distinct stages within LLM layers: knowledge refinement, knowledge elicitation, knowledge expression, and knowledge contestation. We further demonstrate that the relevance of passages guides the streaming of knowledge through these stages. At the module level, we introduce a new method, knowledge activation probability entropy (KAPE) for neuron identification associated with either internal or external knowledge. By selectively deactivating these neurons, we achieve targeted shifts in the LLM's reliance on one knowledge source over the other. Moreover, we discern complementary roles for multi-head attention and multi-layer perceptron layers during knowledge formation. These insights offer a foundation for improving interpretability and reliability in retrieval-augmented LLMs, paving the way for more robust and transparent generative solutions in knowledge-intensive domains.
中文: 本文系统研究了大型语言模型在检索增强生成中如何整合内部与外部知识,揭示了知识处理的四个阶段并引入神经元分析方法,为提升模型可解释性和可靠性奠定基础。
English: This paper systematically investigates how large language models integrate internal and external knowledge in retrieval-augmented generation, revealing four knowledge processing stages and introducing neuron analysis methods to enhance model interpretability and reliability.

Authors:Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang
Title: EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
Abstract:
Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics. Their robustness is also limited in context-rich settings or when editing multiple facts of the same subject simultaneously. We attribute these failures to the embedding misalignment among knowledge items, which undermines editing reliability at scale. To address this, we propose EAMET (Embedding Alignment Model Editing in Transformers), which addresses this issue by aligning the space of key and residual embeddings. Extensive experiments across six LLMs and three datasets demonstrate that EAMET consistently outperforms existing methods, achieving about 90\% editing efficacy when editing 10k facts. Codes and datasets are publicly available at https://ybdai7.github.io/eamet-page/.
Chinese: 针对大规模编辑场景下模型编辑技术效果下降的问题,提出了EAMET方法,通过对齐嵌入空间显著提升了编辑效能和鲁棒性,在多项实验中实现约90%的编辑成功率。
English: Model editing techniques for updating knowledge in large language models face challenges in massive editing scenarios, leading to the proposed EAMET method that aligns embeddings to enhance efficacy and robustness, achieving about 90% effectiveness in extensive experiments.

Authors:Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo
Title: On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating
Abstract:
Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
中文: 本研究通过理论分析和实验验证,深入探讨了DeepSeekMoE的共享专家策略和归一化Sigmoid门控机制,从收敛性分析和路由器行为等方面揭示了其在样本效率上的优势。
English: This study provides a theoretical and empirical analysis of DeepSeekMoE's shared expert strategy and normalized sigmoid gating, demonstrating their enhanced sample efficiency and router behavior through convergence analysis and experiments on synthetic and real-world datasets.

Authors:Suhan Guo, Jiahong Deng, Mengjun Yi, Furao Shen, Jian Zhao
Title: SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
Abstract:
Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $\textbf{S}$ensitivity $\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available https://anonymous.4open.science/r/SPAT-6042.
中文: 提出的SPAT方法通过结构化剪枝去除时间序列预测模型中冗余的注意力机制,在不依赖专用硬件的情况下显著提升了模型性能与计算效率。
English: The proposed SPAT method uses structured pruning to remove redundant attention mechanisms in time series forecasting models, achieving significant performance improvements and computational efficiency without specialized hardware.

Authors:Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao
Title: From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Abstract:
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
中文摘要:FSD模型通过空间关系推理生成中间表征,在仿真和真实机器人任务中分别实现40.6%和72%的成功率,以30%的优势超越基线方法,显著提升了零样本操作性能。
English Summary: The FSD model introduces spatial relationship reasoning to generate intermediate representations, achieving superior zero-shot robotic manipulation performance with 40.6% success in simulation and 72% in real-world tasks, outperforming baselines by 30%.

Authors:Mengjun Yi, Hanwen Zhang, Hui Dou, Jian Zhao, Furao Shen
Title: CacheFL: Privacy-Preserving and Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
Abstract:
Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.
Chinese: CacheFL提出了一种新颖的联邦学习方法,通过生成式预训练模型初始化轻量级缓存模型进行微调,而非完整预训练视觉语言模型,有效降低通信计算成本、缓解非独立同分布数据影响,在保障数据隐私的同时显著提升多数据集分类性能。
English: CacheFL introduces a novel federated learning approach that fine-tunes lightweight cache models instead of full pre-trained VLMs, reducing communication and computation costs while mitigating non-IID data effects through generative initialization and ensuring data privacy and improved classification performance across diverse datasets.

Authors:Tharindu Fernando, Clinton Fookes, Sridha Sridharan, Simon Denman
Title: Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection
Abstract:
Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.
中文: 生成式AI的显著进步催生了高度逼真的深度伪造技术,现有检测器难以应对;本文提出一种基于特征正交解耦的新策略,利用空间和语义信息,在跨数据集测试中显著优于现有最优方法。
English: Generative AI advancements have led to highly realistic deepfakes that challenge existing detectors, prompting this paper to introduce a novel feature disentanglement strategy using spatial and semantic information to significantly outperform current methods in cross-dataset evaluations.

Authors:Changxi Chi, Jun Xia, Jingbo Zhou, Jiabei Cheng, Chang Yu, Stan Z. Li
Title: GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
Abstract:
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.
中文: 本研究提出GRAPE,一种异构图神经网络,通过整合大型语言模型和DNA序列模型提取的基因特征,结合生物型特异性角色,并利用图结构学习动态优化基因调控网络,在遗传扰动预测中实现了最先进的性能。
English: This study introduces GRAPE, a heterogeneous graph neural network that enhances genetic perturbation prediction by integrating gene features from large language and DNA sequence models, incorporating biotype-specific roles, and dynamically refining gene regulatory networks through graph structure learning, achieving state-of-the-art results.

Authors:Zhou Wu, Junyi An, Baile Xu, Furao Shen, Jian Zhao
Title: Physics-inspired Energy Transition Neural Network for Sequence Learning
Abstract:
Recently, the superior performance of Transformers has made them a more robust and scalable solution for sequence modeling than traditional recurrent neural networks (RNNs). However, the effectiveness of Transformer in capturing long-term dependencies is primarily attributed to their comprehensive pair-modeling process rather than inherent inductive biases toward sequence semantics. In this study, we explore the capabilities of pure RNNs and reassess their long-term learning mechanisms. Inspired by the physics energy transition models that track energy changes over time, we propose a effective recurrent structure called the``Physics-inspired Energy Transition Neural Network" (PETNN). We demonstrate that PETNN's memory mechanism effectively stores information over long-term dependencies. Experimental results indicate that PETNN outperforms transformer-based methods across various sequence tasks. Furthermore, owing to its recurrent nature, PETNN exhibits significantly lower complexity. Our study presents an optimal foundational recurrent architecture and highlights the potential for developing effective recurrent neural networks in fields currently dominated by Transformer.
Chinese: 本研究提出了一种名为“物理启发的能量转换神经网络”(PETNN)的循环结构,它能有效捕捉长期依赖关系,在多种序列任务中表现优于基于Transformer的方法,且复杂度显著更低,为当前主导模型提供了有前景的替代方案。
English: The study introduces the Physics-inspired Energy Transition Neural Network (PETNN), a recurrent structure that effectively captures long-term dependencies and outperforms Transformer-based methods in various sequence tasks with lower complexity, offering a promising alternative to current dominant models.

Authors:Shaheer Mohamed, Tharindu Fernando, Sridha Sridharan, Peyman Moghadam, Clinton Fookes
Title: Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data
Abstract:
Hyperspectral images (HSIs) capture rich spectral signatures that reveal vital material properties, offering broad applicability across various domains. However, the scarcity of labeled HSI data limits the full potential of deep learning, especially for transformer-based architectures that require large-scale training. To address this constraint, we propose Spatial-Frequency Masked Image Modeling (SFMIM), a self-supervised pretraining strategy for hyperspectral data that utilizes the large portion of unlabeled data. Our method introduces a novel dual-domain masking mechanism that operates in both spatial and frequency domains. The input HSI cube is initially divided into non-overlapping patches along the spatial dimension, with each patch comprising the entire spectrum of its corresponding spatial location. In spatial masking, we randomly mask selected patches and train the model to reconstruct the masked inputs using the visible patches. Concurrently, in frequency masking, we remove portions of the frequency components of the input spectra and predict the missing frequencies. By learning to reconstruct these masked components, the transformer-based encoder captures higher-order spectral-spatial correlations. We evaluate our approach on three publicly available HSI classification benchmarks and demonstrate that it achieves state-of-the-art performance. Notably, our model shows rapid convergence during fine-tuning, highlighting the efficiency of our pretraining strategy.
Chinese: 针对高光谱图像标注数据稀缺的问题,我们提出空间-频率掩码图像建模(SFMIM),这种自监督预训练方法通过在空间和频率域同时进行掩码与重建,使Transformer模型能够学习光谱-空间关联,在分类任务中实现最优性能并具有快速微调收敛特性。
English: To overcome the scarcity of labeled hyperspectral image data, we propose Spatial-Frequency Masked Image Modeling (SFMIM), a self-supervised pretraining method that masks and reconstructs data in both spatial and frequency domains, enabling transformers to capture spectral-spatial correlations and achieve state-of-the-art classification performance with rapid fine-tuning convergence.

Authors:Xiang Xu, Ruotong Li, Mengjun Yi, Baile XU, Furao Shen, Jian Zhao
Title: Interactive Instance Annotation with Siamese Networks
Abstract:
Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tracking. SiamAnno leverages one-shot learning to annotate previously unseen objects by taking a bounding box as input and predicting object boundaries, which can then be adjusted by annotators. Trained on one dataset and tested on another without fine-tuning, SiamAnno achieves state-of-the-art (SOTA) performance across multiple datasets, demonstrating its ability to handle domain and environment shifts in cross-domain tasks. We also provide more comprehensive results compared to previous work, establishing a strong baseline for future research. To our knowledge, SiamAnno is the first model to explore Siamese architecture for instance annotation.
中文:SiamAnno是一种基于孪生网络的新型框架,通过一次性学习实现跨域实例标注,无需微调即可预测物体边界并达到最先进的性能。
English: SiamAnno is a novel framework that utilizes Siamese networks for one-shot instance annotation, enabling cross-domain object boundary prediction with state-of-the-art performance without requiring fine-tuning.

Authors:Jerry Junyang Cheung, Shiyao Shen, Yuchen Zhuang, Yinghao Li, Rampi Ramprasad, Chao Zhang
Title: MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge
Abstract:
Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.
中文: MSQA作为首个综合评估基准,包含1,757道材料科学研究生级别问题,用于测试大语言模型在七个子领域的知识掌握与多步推理能力,实验发现专有模型准确率最高达84.5%,而开源模型和领域专用模型因过拟合与数据分布偏移表现欠佳。
English: MSQA is introduced as the first comprehensive benchmark with 1,757 graduate-level materials science questions to evaluate LLMs' factual knowledge and complex reasoning across seven sub-fields, revealing significant performance gaps where proprietary models reach 84.5% accuracy while open-source and domain-specific models lag due to overfitting and distribution shifts.

Authors:Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Title: VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Abstract:
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
中文:VisualSphinx首创大规模合成视觉逻辑推理数据集,通过规则到图像的合成流程增强视觉语言模型的逻辑一致性,并提升其在代数、算术和几何推理等多任务中的表现。
English: VisualSphinx introduces a large-scale synthetic dataset for visual logical reasoning, enhancing VLMs' performance through a rule-to-image synthesis pipeline and improving their capabilities across various reasoning tasks.

Authors:Ye Cheng, Minghui Xu, Yue Zhang, Kun Li, Hao Wu, Yechao Zhang, Shaoyong Guo, Wangjie Qiu, Dongxiao Yu, Xiuzhen Cheng
Title: Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things
Abstract:
Access control in the Internet of Things (IoT) is becoming increasingly complex, as policies must account for dynamic and contextual factors such as time, location, user behavior, and environmental conditions. However, existing platforms either offer only coarse-grained controls or rely on rigid rule matching, making them ill-suited for semantically rich or ambiguous access scenarios. Moreover, the policy authoring process remains fragmented: domain experts describe requirements in natural language, but developers must manually translate them into code, introducing semantic gaps and potential misconfiguration. In this work, we present LACE, the Language-based Access Control Engine, a hybrid framework that leverages large language models (LLMs) to bridge the gap between human intent and machine-enforceable logic. LACE combines prompt-guided policy generation, retrieval-augmented reasoning, and formal validation to support expressive, interpretable, and verifiable access control. It enables users to specify policies in natural language, automatically translates them into structured rules, validates semantic correctness, and makes access decisions using a hybrid LLM-rule-based engine. We evaluate LACE in smart home environments through extensive experiments. LACE achieves 100% correctness in verified policy generation and up to 88% decision accuracy with 0.79 F1-score using DeepSeek-V3, outperforming baselines such as GPT-3.5 and Gemini. The system also demonstrates strong scalability under increasing policy volume and request concurrency. Our results highlight LACE's potential to enable secure, flexible, and user-friendly access control across real-world IoT platforms.
中文: 本文提出LACE语言访问控制引擎,通过大语言模型将自然语言策略转化为可验证规则,在物联网环境中实现了高精度策略生成与卓越扩展性。
English: This paper introduces LACE, a language-based access control engine that uses large language models to translate natural language policies into verifiable rules, achieving high accuracy and scalability in IoT environments.

Authors:Lei Yu, Yechao Zhang, Ziqi Zhou, Yang Wu, Wei Wan, Minghui Li, Shengshan Hu, Pei Xiaobing, Jing Wang
Title: Spa-VLM: Stealthy Poisoning Attacks on RAG-based VLM
Abstract:
With the rapid development of the Vision-Language Model (VLM), significant progress has been made in Visual Question Answering (VQA) tasks. However, existing VLM often generate inaccurate answers due to a lack of up-to-date knowledge. To address this issue, recent research has introduced Retrieval-Augmented Generation (RAG) techniques, commonly used in Large Language Models (LLM), into VLM, incorporating external multi-modal knowledge to enhance the accuracy and practicality of VLM systems. Nevertheless, the RAG in LLM may be susceptible to data poisoning attacks. RAG-based VLM may also face the threat of this attack. This paper first reveals the vulnerabilities of the RAG-based large model under poisoning attack, showing that existing single-modal RAG poisoning attacks have a 100\% failure rate in multi-modal RAG scenarios. To address this gap, we propose Spa-VLM (Stealthy Poisoning Attack on RAG-based VLM), a new paradigm for poisoning attacks on large models. We carefully craft malicious multi-modal knowledge entries, including adversarial images and misleading text, which are then injected into the RAG's knowledge base. When users access the VLM service, the system may generate misleading outputs. We evaluate Spa-VLM on two Wikipedia datasets and across two different RAGs. Results demonstrate that our method achieves highly stealthy poisoning, with the attack success rate exceeding 0.8 after injecting just 5 malicious entries into knowledge bases with 100K and 2M entries, outperforming state-of-the-art poisoning attacks designed for RAG-based LLMs. Additionally, we evaluated several defense mechanisms, all of which ultimately proved ineffective against Spa-VLM, underscoring the effectiveness and robustness of our attack.
中文: 本文提出Spa-VLM,一种针对基于检索增强生成的视觉语言模型的隐蔽投毒攻击,通过精心制作恶意多模态知识条目误导系统输出,在攻击成功率和防御抵抗性方面均表现出显著优势。
English: This paper introduces Spa-VLM, a stealthy poisoning attack on RAG-based Vision-Language Models that crafts malicious multi-modal entries to mislead outputs, demonstrating high attack success rates and robustness against existing defenses.

Authors:Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, Xinchao Wang
Title: PixelThink: Towards Efficient Chain-of-Pixel Reasoning
Abstract:
Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.
中文摘要:提出的PixelThink方法通过结合任务难度和模型不确定性来自适应调控多模态分割中的推理长度,在提升效率与性能的同时引入新基准进行全面评估。
English Summary: The proposed PixelThink method adaptively regulates reasoning length in multimodal segmentation by combining task difficulty and model uncertainty, improving both efficiency and performance while introducing a new benchmark for comprehensive evaluation.

Authors:Changyi Lin, Yuxin Ray Song, Boda Huo, Mingyang Yu, Yikai Wang, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Yiyue Luo, Ding Zhao
Title: LocoTouch: Learning Dynamic Quadrupedal Transport with Tactile Sensing
Abstract:
Quadrupedal robots have demonstrated remarkable agility and robustness in traversing complex terrains. However, they struggle with dynamic object interactions, where contact must be precisely sensed and controlled. To bridge this gap, we present LocoTouch, a system that equips quadrupedal robots with tactile sensing to address a particularly challenging task in this category: long-distance transport of unsecured cylindrical objects, which typically requires custom mounting or fastening mechanisms to maintain stability. For efficient large-area tactile sensing, we design a high-density distributed tactile sensor that covers the entire back of the robot. To effectively leverage tactile feedback for robot control, we develop a simulation environment with high-fidelity tactile signals, and train tactile-aware transport policies using a two-stage learning pipeline. Furthermore, we design a novel reward function to promote robust, symmetric, and frequency-adaptive locomotion gaits. After training in simulation, LocoTouch transfers zero-shot to the real world, reliably transporting a wide range of unsecured cylindrical objects with diverse sizes, weights, and surface properties. Moreover, it remains robust over long distances, on uneven terrain, and under severe perturbations.
中文: LocoTouch系统通过为四足机器人配备高密度触觉传感器和两阶段学习策略,实现了对多种无固定圆柱物体的零样本实时搬运,并在不同地形和干扰下展现出卓越的鲁棒性。
English: LocoTouch equips quadrupedal robots with high-density tactile sensors and a two-stage learning pipeline to enable zero-shot real-world transport of diverse unsecured cylindrical objects, demonstrating robustness across varying terrains and disturbances.

Authors:Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, Chongyang Ma
Title: ATI: Any Trajectory Instruction for Controllable Video Generation
Abstract:
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.
中文: 我们提出了一种统一的视频生成运动控制框架,通过轨迹输入无缝整合相机运动、物体平移和局部精细运动,在多种任务中展现出卓越的可控性和视觉质量。
English: We introduce a unified motion control framework for video generation that integrates camera movement, object translation, and local motion through trajectory inputs, achieving superior controllability and visual quality across various tasks.

Authors:Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, Chao Zhang
Title: WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning
Abstract:
Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.
中文摘要:基于大语言模型的网络代理在动态网页环境中常缺乏泛化能力,而新推出的WorkForceAgent-R1通过基于规则的强化学习训练,显著优于监督微调方法,有效提升了商务网页导航任务中的推理与规划能力。
English Summary: Large language model-based web agents often lack generalization in dynamic web environments, but the newly introduced WorkForceAgent-R1, trained with rule-based reinforcement learning, significantly outperforms supervised fine-tuning methods by enhancing reasoning and planning for business web tasks.

Authors:Zhiyuan Li, Yi Chang, Yuan Wu
Title: THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
Abstract:
Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.
Chinese: 大型推理模型常出现过度思考问题,产生冗余标记而降低计算效率,为此我们引入Think-Bench基准,从多维度评估并推动其推理效率的研究进展。
English: Large reasoning models often exhibit overthinking, generating redundant tokens that reduce computational efficiency, prompting the introduction of Think-Bench to evaluate and improve their reasoning efficiency across various dimensions.

Authors:Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran
Title: SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
Abstract:
Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.
Chinese: 大型语言模型在复杂任务中表现出色,但在应对知识密集型高风险科学场景时的安全性评估不足,为此我们推出了SOSBench基准测试,发现即使声称安全对齐的先进模型仍存在高比例的有害内容输出,揭示了严重的安全隐患。
English: Large language models show strong capabilities in complex tasks but remain inadequately tested for safety in knowledge-intensive, high-risk scientific scenarios, prompting the introduction of SOSBench, a comprehensive benchmark that reveals alarmingly high rates of harmful responses from advanced models despite their safety claims.

Authors:Yuchen Zhuang, Aaron Trinh, Rushi Qiang, Haotian Sun, Chao Zhang, Hanjun Dai, Bo Dai
Title: Towards Better Instruction Following Retrieval Models
Abstract:
Modern information retrieval (IR) models, trained exclusively on standard pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.
Chinese: InF-IR是一个大规模训练语料库,通过将传统检索对扩展为包含困难负样本的指令化三元组,专门提升信息检索中的指令遵循能力,基于此训练的InF-Embed模型在五项基准测试中以p-MRR指标显著领先基线8.1%。
English: InF-IR is a large-scale training corpus designed to improve instruction-following in information retrieval by expanding traditional pairs into expressive triplets with hard negatives, enabling the development of InF-Embed, which significantly outperforms baselines by 8.1% in p-MRR across benchmarks.

Authors:Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
Title: AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Abstract:
Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
Chinese: AutoJudger采用自适应框架,结合项目反应理论和自主评估代理动态筛选最具信息量的测试题目,在多模态基准测试中将评估成本降低高达96%,同时保持超过90%排名准确率。
English: AutoJudger is an adaptive framework that uses Item Response Theory and an autonomous agent to dynamically select the most informative test questions, reducing evaluation costs by up to 96% while maintaining over 90% ranking accuracy on multimodal benchmarks.

Authors:Yi-Cheng Lin, Kang-Chieh Chen, Zhe-Yan Li, Tzu-Heng Wu, Tzu-Hsuan Wu, Kuan-Yu Chen, Hung-yi Lee, Yun-Nung Chen
Title: Creativity in LLM-based Multi-Agent Systems: A Survey
Abstract:
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.
中文: 本综述首次聚焦于大语言模型驱动的多智能体系统中的创造力问题,通过提出智能体分类体系、生成技术及核心挑战,为推进创造性人工智能协作提供了结构化框架与发展路线图。
English: This survey is the first to focus on creativity in large language model-driven multi-agent systems, presenting a taxonomy of agent design, generation techniques, and key challenges to provide a framework for advancing creative AI collaboration.

Authors:Wooseong Yang, Weizhi Zhang, Yuqing Liu, Yuwei Han, Yu Wang, Junhyun Lee, Philip S. Yu
Title: Cold-Start Recommendation with Knowledge-Guided Retrieval-Augmented Generation
Abstract:
Cold-start items remain a persistent challenge in recommender systems due to their lack of historical user interactions, which collaborative models rely on. While recent zero-shot methods leverage large language models (LLMs) to address this, they often struggle with sparse metadata and hallucinated or incomplete knowledge. We propose ColdRAG, a retrieval-augmented generation approach that builds a domain-specific knowledge graph dynamically to enhance LLM-based recommendation in cold-start scenarios, without requiring task-specific fine-tuning. ColdRAG begins by converting structured item attributes into rich natural-language profiles, from which it extracts entities and relationships to construct a unified knowledge graph capturing item semantics. Given a user's interaction history, it scores edges in the graph using an LLM, retrieves candidate items with supporting evidence, and prompts the LLM to rank them. By enabling multi-hop reasoning over this graph, ColdRAG grounds recommendations in verifiable evidence, reducing hallucinations and strengthening semantic connections. Experiments on three public benchmarks demonstrate that ColdRAG surpasses existing zero-shot baselines in both Recall and NDCG. This framework offers a practical solution to cold-start recommendation by combining knowledge-graph reasoning with retrieval-augmented LLM generation.
中文:ColdRAG提出了一种检索增强生成框架,通过动态构建领域知识图谱来改进冷启动商品的零样本推荐,无需任务特定微调即可在召回率和NDCG指标上超越现有方法。
English: ColdRAG introduces a retrieval-augmented generation framework that dynamically constructs a domain-specific knowledge graph to enhance zero-shot recommendations for cold-start items, outperforming existing methods in recall and NDCG metrics without task-specific fine-tuning.

Authors:Jiabo Ma, Yingxue Xu, Fengtao Zhou, Yihui Wang, Cheng Jin, Zhengrui Guo, Jianfeng Wu, On Ki Tang, Huajun Zhou, Xi Wang, Luyang Luo, Zhengyu Zhang, Du Cai, Zizhao Gao, Wei Wang, Yueping Liu, Jiankun He, Jing Cui, Zhenhui Li, Jing Zhang, Feng Gao, Xiuming Zhang, Li Liang, Ronald Cheong Kin Chan, Zhe Wang, Hao Chen
Title: PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology
Abstract:
The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.
中文: 病理学基础模型通过实现高精度的全切片图像分析革新了计算病理学,尽管其临床应用面临癌症类型间模型差异和数据泄露风险等挑战。
English: Pathology foundation models have transformed computational histopathology by enabling highly accurate whole-slide image analysis for cancer diagnosis and prognosis, though their clinical adoption faces challenges such as model variability across cancer types and data leakage risks.

Authors:Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran
Title: Temporal Sampling for Forgotten Reasoning in LLMs
Abstract:
Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
Chinese: 微调大型语言模型可能导致时间性遗忘,使其忘记已掌握的解题能力,但通过时间采样这一解码策略,利用多个训练检查点恢复被遗忘的解决方案,无需重新训练即可显著提升推理性能。
English: Fine-tuning large language models can cause temporal forgetting, where they lose previously learned problem-solving abilities, but this issue is effectively mitigated by Temporal Sampling, a decoding strategy that leverages multiple training checkpoints to recover forgotten solutions and significantly boost reasoning performance without additional training.

Authors:Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Krishna Kalyan, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
Title: EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition
Abstract:
Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We built EmpathicInsight-Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.
Chinese: EmoNet Face 通过引入包含40种情感的分类体系、平衡的人口统计数据集以及达到人类专家水平的模型,解决了当前情感识别基准的局限性,推动了人机交互的发展。
English: EmoNet Face introduces a comprehensive 40-category emotion taxonomy, balanced demographic datasets, and a human-expert-level model to address current limitations in emotion recognition benchmarks and advance human-AI interaction.

Authors:Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan
Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Abstract:
Large Language Models (LLMs) are increasingly used in Spoken Language Understanding (SLU), where effective multimodal learning depends on the alignment between audio and text. Despite various fusion methods, no standard metric exists to assess this alignment. This work introduces ALAS (Automatic Latent Alignment Score), a metric that evaluates alignment by measuring correlations between audio and text representations across transformer layers. Experiments on Spoken Question Answering and Emotion Recognition show that ALAS captures meaningful patterns across tasks and layers.
中文: 本文提出了ALAS这一新指标,通过分析跨Transformer层的相关性来评估多模态学习中音频与文本的对齐程度,并在口语问答和情感识别等任务中验证了其能有效捕捉跨任务和跨层的有意义模式。
English: This paper introduces ALAS, a novel metric for evaluating audio-text alignment in multimodal learning by analyzing correlations across transformer layers, demonstrating its effectiveness in capturing meaningful patterns across tasks like Spoken Question Answering and Emotion Recognition.

Authors:Mohammed D. Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal
Title: Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation
Abstract:
Training large neural networks through gradient-based optimization requires navigating high-dimensional loss landscapes, which often exhibit pathological geometry, leading to undesirable training dynamics. In particular, poor generalization frequently results from convergence to sharp minima that are highly sensitive to input perturbations, causing the model to overfit the training data while failing to generalize to unseen examples. Furthermore, these optimization procedures typically display strong dependence on the fine structure of the loss landscape, leading to unstable training dynamics, due to the fractal-like nature of the loss surface. In this work, we propose an alternative optimizer that simultaneously reduces this dependence, and avoids sharp minima, thereby improving generalization. This is achieved by simulating the motion of the center of a ball rolling on the loss landscape. The degree to which our optimizer departs from the standard gradient descent is controlled by a hyperparameter, representing the radius of the ball. Changing this hyperparameter allows for probing the loss landscape at different scales, making it a valuable tool for understanding its geometry.
中文摘要:本文提出一种新颖的优化器,通过模拟球体在损失曲面上的滚动来降低对精细结构的敏感性并避开尖锐极小值,利用代表球体半径的可调超参数提升模型泛化能力。
English Summary: This paper introduces a novel optimizer that simulates a ball rolling on the loss landscape to reduce sensitivity to its fine structure and avoid sharp minima, thereby enhancing generalization through a tunable hyperparameter representing the ball's radius.

Authors:Jan Held, Renaud Vandeghen, Adrien Deliege, Abdullah Hamdi, Silvio Giancola, Anthony Cioppa, Andrea Vedaldi, Bernard Ghanem, Andrea Tagliasacchi, Marc Van Droogenbroeck
Title: Triangle Splatting for Real-Time Radiance Field Rendering
Abstract:
The field of computer graphics was revolutionized by models such as Neural Radiance Fields and 3D Gaussian Splatting, displacing triangles as the dominant representation for photogrammetry. In this paper, we argue for a triangle comeback. We develop a differentiable renderer that directly optimizes triangles via end-to-end gradients. We achieve this by rendering each triangle as differentiable splats, combining the efficiency of triangles with the adaptive density of representations based on independent primitives. Compared to popular 2D and 3D Gaussian Splatting methods, our approach achieves higher visual fidelity, faster convergence, and increased rendering throughput. On the Mip-NeRF360 dataset, our method outperforms concurrent non-volumetric primitives in visual fidelity and achieves higher perceptual quality than the state-of-the-art Zip-NeRF on indoor scenes. Triangles are simple, compatible with standard graphics stacks and GPU hardware, and highly efficient: for the \textit{Garden} scene, we achieve over 2,400 FPS at 1280x720 resolution using an off-the-shelf mesh renderer. These results highlight the efficiency and effectiveness of triangle-based representations for high-quality novel view synthesis. Triangles bring us closer to mesh-based optimization by combining classical computer graphics with modern differentiable rendering frameworks. The project page is https://trianglesplatting.github.io/
中文: 本文提出了一种通过端到端梯度优化三角形的可微分渲染器,相比现有方法实现了更高的视觉保真度、更快的收敛速度和渲染性能,复兴了基于三角形的高质量新视角合成表示。
English: This paper introduces a differentiable renderer that optimizes triangles through end-to-end gradients, achieving higher visual fidelity, faster convergence, and superior rendering performance compared to current methods, revitalizing triangle-based representations for high-quality novel view synthesis.

Authors:Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli
Title: LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Abstract:
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
Chinese: LiSTEN框架通过可学习的键值对动态提示选择策略,使大型语言模型适应音频任务,以更少参数实现优异性能,并通过提示分析增强可解释性。
English: The LiSTEN framework adapts large language models to audio tasks using dynamic prompt selection with learnable key-value pairs, achieving competitive performance with fewer parameters and enhanced interpretability through prompt analysis.

Authors:Yuting Huang, Ziquan Fang, Zhihao Zeng, Lu Chen, Yunjun Gao
Title: Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach
Abstract:
Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.
中文: E²-CSTP框架通过跨模态注意力与门控机制融合多模态数据,采用双分支因果推断消除混杂因素,结合GCN-Mamba提升计算效率,在精度和效率上均显著优于现有方法。
English: The proposed E²-CSTP framework effectively enhances spatio-temporal prediction by integrating multi-modal data through cross-modal attention and gating mechanisms, employing causal inference to address confounding factors, and optimizing computational efficiency with GCN-Mamba integration, achieving superior accuracy and reduced overhead.

Authors:Qi Zhang, Shouqing Yang, Lirong Gao, Hao Chen, Xiaomeng Hu, Jinglei Chen, Jiexiang Wang, Sheng Guo, Bo Zheng, Haobo Wang, Junbo Zhao
Title: LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.
中文摘要:学习式思考与搜索(LeTS)框架通过融合过程级和结果级奖励,提升了检索增强生成中大型语言模型的中间推理步骤,在多个基准测试中表现出卓越性能。
English Summary: The Learning to Think-and-Search (LeTS) framework enhances retrieval-augmented generation by combining process-level and outcome-based rewards to improve intermediate reasoning steps in large language models, showing superior performance across benchmarks.

Authors:Yuanshao Zhu, James Jianqiao Yu, Xiangyu Zhao, Xiao Han, Qidong Liu, Xuetao Wei, Yuxuan Liang
Title: Learning Generalized and Flexible Trajectory Models from Omni-Semantic Supervision
Abstract:
The widespread adoption of mobile devices and data collection technologies has led to an exponential increase in trajectory data, presenting significant challenges in spatio-temporal data mining, particularly for efficient and accurate trajectory retrieval. However, existing methods for trajectory retrieval face notable limitations, including inefficiencies in large-scale data, lack of support for condition-based queries, and reliance on trajectory similarity measures. To address the above challenges, we propose OmniTraj, a generalized and flexible omni-semantic trajectory retrieval framework that integrates four complementary modalities or semantics -- raw trajectories, topology, road segments, and regions -- into a unified system. Unlike traditional approaches that are limited to computing and processing trajectories as a single modality, OmniTraj designs dedicated encoders for each modality, which are embedded and fused into a shared representation space. This design enables OmniTraj to support accurate and flexible queries based on any individual modality or combination thereof, overcoming the rigidity of traditional similarity-based methods. Extensive experiments on two real-world datasets demonstrate the effectiveness of OmniTraj in handling large-scale data, providing flexible, multi-modality queries, and supporting downstream tasks and applications.
中文:OmniTraj框架通过整合四种语义模态到统一系统中,解决了轨迹检索的局限性,支持灵活的多模态查询,并在大规模数据处理中展现出高效性。
English: The OmniTraj framework addresses limitations in trajectory retrieval by integrating four semantic modalities into a unified system, enabling flexible, multi-modal queries and demonstrating effectiveness in large-scale data handling.

Authors:Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Xiaojun Wu, Honghao Liu, Hui Xiong, Jian Guo
Title: Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning
Abstract:
A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
中文: Select2Reason通过量化问题难度并结合推理长度启发式方法,从大规模指令集中筛选高质量长思维链样本,仅需10%数据微调即可在数学基准测试中达到或超越全数据训练效果。
English: Select2Reason is an efficient framework that selects high-quality long-chain-of-thought instructions by evaluating question difficulty and reasoning length, enabling models fine-tuned on just 10% of data to match or exceed full-data performance across multiple mathematical benchmarks.

Authors:Xiaoxue Cheng, Junyi Li, Zhenduo Zhang, Xinyu Tang, Wayne Xin Zhao, Xinyu Kong, Zhiqiang Zhang
Title: Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
Abstract:
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.
中文: 大型推理模型常存在过度思考问题,而提出的自适应认知策略优化(ACPO)框架通过基于任务难度自适应分配认知资源和动态切换思维模式,实现了高效推理。
English: Large reasoning models often overthink, but the proposed Adaptive Cognition Policy Optimization (ACPO) framework enables efficient reasoning by adaptively allocating cognitive resources and dynamically switching thinking modes based on task difficulty.

Authors:Liang-Yeh Shen, Shi-Xin Fang, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Title: Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning
Abstract:
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
中文:Meta-PerSER是一种新颖的元学习框架,通过结合增强型MAML方法和预训练模型,针对每位听者的独特情感解读方式进行个性化语音情感识别,在IEMOCAP数据集上显著优于基线方法。
English: Meta-PerSER is a meta-learning framework that personalizes Speech Emotion Recognition by adapting to individual listeners' interpretation styles using enhanced MAML techniques and pre-trained models, achieving superior performance over baselines on the IEMOCAP corpus.

Authors:Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
Title: OViP: Online Vision-Language Preference Learning
Abstract:
Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.
中文: 本研究提出的在线视觉语言偏好学习(OViP)框架通过动态构建模型自身幻觉输出的对比训练数据,并利用扩散模型合成负样本图像实现实时监督,在保持核心多模态能力的同时显著缓解了视觉语言模型的幻觉问题。
English: The proposed Online Vision-language Preference Learning (OViP) framework dynamically generates contrastive training data from the model's own hallucinations and synthesizes negative images to provide real-time supervision, effectively reducing visual-text misalignment while maintaining multimodal performance.

Authors:Yu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee
Title: ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality
Abstract:
Despite extensive research on toxic speech detection in text, a critical gap remains in handling spoken Mandarin audio. The lack of annotated datasets that capture the unique prosodic cues and culturally specific expressions in Mandarin leaves spoken toxicity underexplored. To address this, we introduce ToxicTone -- the largest public dataset of its kind -- featuring detailed annotations that distinguish both forms of toxicity (e.g., profanity, bullying) and sources of toxicity (e.g., anger, sarcasm, dismissiveness). Our data, sourced from diverse real-world audio and organized into 13 topical categories, mirrors authentic communication scenarios. We also propose a multimodal detection framework that integrates acoustic, linguistic, and emotional features using state-of-the-art speech and emotion encoders. Extensive experiments show our approach outperforms text-only and baseline models, underscoring the essential role of speech-specific cues in revealing hidden toxic expressions.
Chinese: 本研究推出了ToxicTone,这是最大的公开普通话语音毒性检测数据集,包含细致标注,并采用融合声学、语言和情感特征的多模态框架,其性能优于纯文本模型。
English: This study introduces ToxicTone, the largest public dataset for detecting toxic speech in Mandarin audio, featuring detailed annotations and a multimodal framework that integrates acoustic, linguistic, and emotional features to outperform text-only models.

Authors:Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Title: Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach
Abstract:
While subgroup disparities and performance bias are increasingly studied in computational research, fairness in categorical Speech Emotion Recognition (SER) remains underexplored. Existing methods often rely on explicit demographic labels, which are difficult to obtain due to privacy concerns. To address this limitation, we introduce an Implicit Demography Inference (IDI) module that leverages pseudo-labeling from a pre-trained model and unsupervised learning using k-means clustering to mitigate bias in SER. Our experiments show that pseudo-labeling IDI reduces subgroup disparities, improving fairness metrics by over 28% with less than a 2% decrease in SER accuracy. Also, the unsupervised IDI yields more than a 4.6% improvement in fairness metrics with a drop of less than 3.6% in SER performance. Further analyses reveal that the unsupervised IDI consistently mitigates race and age disparities, demonstrating its potential when explicit demographic information is unavailable.
Chinese: 提出的隐式人口统计推断模块通过伪标签和无监督学习,将语音情感识别中的子群差异降低了超过28%,公平性指标提升了4.6%,同时对识别准确率影响甚微。
English: The proposed Implicit Demography Inference module effectively reduces subgroup disparities in speech emotion recognition by over 28% using pseudo-labeling and improves fairness metrics by 4.6% through unsupervised learning, with minimal impact on recognition accuracy.

Authors:Zaid Abdullah, Mario R. Camana, Abuzar B. M. Adam, Chandan K. Sheemar, Eva Lagunas, Symeon Chatzinotas
Title: Swarm Intelligence Optimization of Multi-RIS Aided MmWave Beamspace MIMO
Abstract:
We investigate the performance of a multiple reconfigurable intelligence surface (RIS)-aided millimeter wave (mmWave) beamspace multiple-input multiple-output (MIMO) system with multiple users (UEs). We focus on a challenging scenario in which the direct links between the base station (BS) and all UEs are blocked, and communication is facilitated only via RISs. The maximum ratio transmission (MRT) is utilized for data precoding, while a low-complexity algorithm based on particle swarm optimization (PSO) is designed to jointly perform beam selection, power allocation, and RIS profile configuration. The proposed optimization approach demonstrates positive trade-offs between the complexity (in terms of running time) and the achievable sum rate. In addition, our results demonstrate that due to the sparsity of beamspace channels, increasing the number of unit cells (UCs) at RISs can lead to higher achievable rates than activating a larger number of beams at the MIMO BS.
中文: 本研究探讨了在基站与用户直接链路受阻时完全依赖可重构智能表面的毫米波波束空间多输入多输出系统,采用最大比传输预编码和基于粒子群优化的低复杂度算法联合优化波束选择、功率分配与RIS配置,在复杂度与速率间取得良好平衡,并证明增加RIS单元比激活更多基站波束更能提升系统速率。
English: This study explores a multi-RIS-assisted mmWave beamspace MIMO system where communication relies solely on RISs due to blocked direct links, employing MRT precoding and a PSO-based algorithm to optimize beam selection, power allocation, and RIS configuration, achieving a favorable complexity-rate trade-off and showing that increasing RIS unit cells enhances rates more effectively than adding beams at the BS.

Authors:Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun
Title: From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Abstract:
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
Chinese: 多语平行数据,如新推出的TED2025语料库,通过确保跨语言一致性显著提升大语言模型的多语言性能,在各项基准测试中均优于未对齐数据。
English: Multi-way parallel data, such as the introduced TED2025 corpus, significantly enhances large language models' multilingual performance by ensuring cross-lingual consistency, outperforming unaligned data across benchmarks.

Authors:Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye
Title: EfficientLLM: Efficiency in Large Language Models
Abstract:
Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.
中文: 大型语言模型因参数和上下文窗口增长面临高昂成本,为此推出EfficientLLM基准,系统评估预训练架构、微调方法和推理优化的效率技术,揭示量化权衡与任务相关性规律,并验证技术在多模态模型中的可迁移性。
English: Large Language Models face prohibitive costs due to increasing parameters and context windows, prompting the introduction of EfficientLLM—a comprehensive benchmark evaluating efficiency techniques across architecture pretraining, fine-tuning, and inference, revealing quantifiable trade-offs and task-dependent optima while demonstrating cross-modal generalization.

Authors:Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee
Title: Distilling a speech and music encoder with task arithmetic
Abstract:
Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Knowledge Distillation of teacher ensembles may be a natural solution, but we posit that decoupling the distillation of the speech and music SSL models allows for more flexibility. Thus, we propose to learn distilled task vectors and then linearly interpolate them to form a unified speech+music model. This strategy enables flexible domain emphasis through adjustable weights and is also simpler to train. Experiments on speech and music benchmarks demonstrate that our method yields superior overall performance compared to ensemble distillation.
中文: 本研究提出通过学习和插值蒸馏任务向量来构建统一的语音与音乐音频模型,该方法能灵活调整领域重点,并在性能上优于集成蒸馏。
English: This study introduces a method to create a unified audio model for speech and music by learning and interpolating distilled task vectors, offering flexible domain emphasis and improved performance over ensemble distillation.

Authors:Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib
Title: AdaDim: Dimensionality Adaptation for SSL Representational Dynamics
Abstract:
A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, which is where higher-dimensional representation spaces span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce representations ($R$) with a higher dimensionality. Dimensionality is either optimized through a dimension-contrastive approach that encourages feature decorrelation or through a sample-contrastive method that promotes a uniform spread of sample representations. Both families of SSL algorithms also utilize a projection head that maps $R$ into a lower-dimensional embedding space $Z$. Recent work has characterized the projection head as a filter of irrelevant features from the SSL objective by reducing mutual information, $I(R;Z)$. Therefore, the current literature's view is that a good SSL representation space should have a high $H(R)$ and a low $I(R;Z)$. However, this view of the problem is lacking in terms of an understanding of the underlying training dynamics that influences both terms, as well as how the values of $H(R)$ and $I(R;Z)$ arrived at the end of training reflect the downstream performance of an SSL model. We address both gaps in the literature by demonstrating that increases in $H(R)$ due to feature decorrelation at the start of training lead to a higher $I(R;Z)$, while increases in $H(R)$ due to samples distributing uniformly in a high-dimensional space at the end of training cause $I(R;Z)$ to plateau or decrease. Furthermore, our analysis shows that the best performing SSL models do not have the highest $H(R)$ nor the lowest $I(R;Z)$, but arrive at an optimal intermediate point for both. We develop a method called AdaDim to exploit these observed training dynamics by adaptively weighting between losses based on feature decorrelation and uniform sample spread.
中文摘要:有效的自监督学习需要在表示空间的高特征多样性与投影头的信息过滤之间取得平衡,我们的自适应训练策略AdaDim无需复杂技术即可实现这一目标。
English Summary: Effective self-supervised learning requires balancing high feature diversity in the representation space with controlled information filtering through the projection head, which our adaptive training strategy AdaDim achieves without costly techniques.

Authors:Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib
Title: AdaDim: Dimensionality Adaptation for SSL Representational Dynamics
Abstract:
A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, where higher-dimensional representation spaces ($R$) span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce $R$ with a higher dimensionality ($H(R)$) through objectives that encourage decorrelation of features or sample uniformity in $R$. A higher $H(R)$ indicates that $R$ has greater feature diversity which is useful for generalization to downstream tasks. Alongside dimensionality optimization, SSL algorithms also utilize a projection head that maps $R$ into an embedding space $Z$. Recent work has characterized the projection head as a filter of noisy or irrelevant features from the SSL objective by reducing the mutual information $I(R;Z)$. Therefore, the current literature's view is that a good SSL representation space should have a high $H(R)$ and a low $I(R;Z)$. However, this view of SSL is lacking in terms of an understanding of the underlying training dynamics that influences the relationship between both terms. Our analysis shows that the best performing SSL models do not have the highest $H(R)$ nor the lowest $I(R;Z)$, but effectively arrive at a balance between both. To take advantage of this analysis, we introduce AdaDim, a training strategy that leverages SSL training dynamics by adaptively balancing between increasing $H(R)$ through feature decorrelation and sample uniformity as well as gradual regularization of $I(R;Z)$ as training progresses. We show performance improvements of up to 3% over common SSL baselines despite our method not utilizing expensive techniques such as queues, clustering, predictor networks, or student-teacher architectures.
中文摘要:有效的自监督学习需要在表示空间的高特征多样性与投影头的信息过滤之间取得平衡,我们的自适应训练策略AdaDim无需复杂技术即可实现这一目标。
English Summary: Effective self-supervised learning requires balancing high feature diversity in the representation space with controlled information filtering through the projection head, which our adaptive training strategy AdaDim achieves without costly techniques.

Authors:Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang
Title: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption
Abstract:
Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.
中文摘要:本文提出基于Transformer的视频分词器VFRTok,采用时长比例信息假设实现可变帧率编码,仅需1/8的令牌数量即可保持卓越的视频生成质量,同时引入部分旋转位置编码提升内容感知能力。
English Summary: The paper introduces VFRTok, a Transformer-based video tokenizer that adopts the Duration-Proportional Information Assumption to enable variable frame rate encoding, significantly reducing token usage by 1/8 while maintaining high video generation quality.

Authors:Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang
Title: Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Abstract:
Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model's capability in vision modeling. To alleviate computational overhead, we explore a Visual-Native Selection mechanism that independently assesses token significance by both the visual and native experts, followed by aggregation to retain the most critical subset. Extensive experiments show that our approach reduces visual tokens by 75--95% while achieving comparable or superior performance across 12 benchmarks, significantly improving efficiency.
中文摘要:LLaVA-Meteor采用新型自上而下压缩范式,通过Flash全局融合模块和视觉-原生选择机制,在保持核心视觉信息的同时将视觉标记减少75-95%,在12个基准测试中实现相当或更优性能,成功解决了视觉指令调整中精度与效率的权衡难题。
English Summary: LLaVA-Meteor introduces a Top-Down Compression paradigm with Flash Global Fusion and Visual-Native Selection to drastically reduce visual tokens by 75-95% while maintaining or enhancing performance across benchmarks, effectively balancing accuracy and efficiency in visual instruction tuning.

Authors:Tairan Fu, Miguel González, Javier Conde, Elena Merino-Gómez, Pedro Reviriego
Title: Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?
Abstract:
Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
中文: 多模态大语言模型因训练数据不足而难以识别模拟时钟时间,尽管微调有所改进,但其真正的泛化能力仍然有限。
English: Multimodal Large Language Models struggle to tell time on analog clocks due to insufficient training data, and while fine-tuning shows progress, their ability to truly generalize remains limited.

Authors:Brandon Smith, Mohamed Reda Bouadjenek, Tahsin Alamgir Kheya, Phillip Dawson, Sunil Aryal
Title: A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias
Abstract:
Large Language Models (LLMs) represent a major step toward artificial general intelligence, significantly advancing our ability to interact with technology. While LLMs perform well on Natural Language Processing tasks -- such as translation, generation, code writing, and summarization -- questions remain about their output similarity, variability, and ethical implications. For instance, how similar are texts generated by the same model? How does this compare across different models? And which models best uphold ethical standards? To investigate, we used 5{,}000 prompts spanning diverse tasks like generation, explanation, and rewriting. This resulted in approximately 3 million texts from 12 LLMs, including proprietary and open-source systems from OpenAI, Google, Microsoft, Meta, and Mistral. Key findings include: (1) outputs from the same LLM are more similar to each other than to human-written texts; (2) models like WizardLM-2-8x22b generate highly similar outputs, while GPT-4 produces more varied responses; (3) LLM writing styles differ significantly, with Llama 3 and Mistral showing higher similarity, and GPT-4 standing out for distinctiveness; (4) differences in vocabulary and tone underscore the linguistic uniqueness of LLM-generated content; (5) some LLMs demonstrate greater gender balance and reduced bias. These results offer new insights into the behavior and diversity of LLM outputs, helping guide future development and ethical evaluation.
中文: 大语言模型在输出上表现出显著的内部相似性,其中GPT-4更具多样性,同时展现出独特的语言风格和不同的伦理表现,为未来发展和评估提供了重要参考。
English: Large Language Models (LLMs) demonstrate significant output similarity within models, with GPT-4 showing greater variability, while exhibiting distinct linguistic styles and varying ethical performance, providing insights for future development and evaluation.

Authors:Jorge Quesada, Chen Zhou, Prithwijit Chowdhury, Mohammad Alotaibi, Ahmad Mustafa, Yusufjon Kumakov, Mohit Prabhushankar, Ghassan AlRegib
Title: A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior
Abstract:
Machine learning has taken a critical role in seismic interpretation workflows, especially in fault delineation tasks. However, despite the recent proliferation of pretrained models and synthetic datasets, the field still lacks a systematic understanding of the generalizability limits of these models across seismic data representing diverse geologic, acquisition and processing settings. Distributional shifts between data sources, limitations in fine-tuning strategies and labeled data accessibility, and inconsistent evaluation protocols all remain major roadblocks to deploying reliable models in real-world exploration. In this paper, we present the first large-scale benchmarking study explicitly designed to provide guidelines for domain shift strategies in seismic interpretation. Our benchmark spans over 200 combinations of model architectures, datasets and training strategies, across three datasets (synthetic and real) including FaultSeg3D, CRACKS, and Thebe. We systematically assess pretraining, fine-tuning, and joint training under varying domain shifts. Our analysis shows that common fine-tuning practices can lead to catastrophic forgetting, especially when source and target datasets are disjoint, and that larger models such as Segformer are more robust than smaller architectures. We also find that domain adaptation methods outperform fine-tuning when shifts are large, yet underperform when domains are similar. Finally, we complement segmentation metrics with a novel analysis based on fault characteristic descriptors, revealing how models absorb structural biases from training datasets. Overall, we establish a robust experimental baseline that provides insights into tradeoffs in current fault delineation workflows and highlights directions for building more generalizable and interpretable models.
中文: 本研究首次对地震断层解释的机器学习模型进行大规模基准测试,发现微调常导致灾难性遗忘,而领域自适应方法在数据差异大时表现优异但在相似领域效果欠佳,为提升跨地质环境模型泛化能力提供了重要指导。
English: This study conducts the first large-scale benchmarking analysis of machine learning models for seismic fault interpretation, revealing that fine-tuning often causes catastrophic forgetting while domain adaptation excels with significant data shifts but underperforms with similar domains, establishing key guidelines for improving model generalizability across diverse geological settings.

Authors:Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
Title: MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Abstract:
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.
中文: MLE-Dojo 是一个基于200多个真实Kaggle挑战构建的交互式框架,支持通过强化学习迭代训练和评估自主大语言模型代理,尽管现有模型在长周期问题解决上仍有局限,但其灵活架构促进了工程任务的可扩展性与可复现性。
English: MLE-Dojo is a Gym-style framework designed for reinforcement learning and iterative improvement of autonomous LLM agents using real-world Kaggle challenges, enabling interactive experimentation and structured feedback while highlighting current models' limitations in long-horizon problem-solving.

Authors:Almoatssimbillah Saifaldawla, Eva Lagunas, Flor Ortiz, Abuzar B. M. Adam, Symeon Chatzinotas
Title: SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems
Abstract:
In this paper, we investigate downlink co-frequency interference (CFI) mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems. Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null towards the direction of arrivals (DOAs) of the interfering signals, but they suffer from high computational complexity due to matrix inversions and required knowledge of the channel state information (CSI). Furthermore, adaptive beamformers, such as sample matrix inversion (SMI)-based minimum variance, provide poor performance when the available snapshots are limited. We propose a Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning (DL) approach and can be deployed on the user terminal (UT) antenna array, for assisting downlink beamforming and CFI mitigation using only a limited number of available array snapshots as input, and without CSI knowledge. Simulation results demonstrate that MambaBF consistently outperforms conventional beamforming techniques in mitigating interference and maximizing the signal-to-interference-plus-noise ratio (SINR), particularly under challenging conditions characterized by low SINR, limited snapshots, and imperfect CSI.
中文摘要:本文提出MambaBF波束成形器,这种基于无监督深度学习的方案能在非静止轨道卫星系统中仅利用有限快照且无需信道状态信息即可有效抑制下行同频干扰,在低信干噪比和有限样本等挑战性条件下性能优于传统方法。
English Summary: This paper introduces MambaBF, an unsupervised deep learning-based beamformer that effectively mitigates downlink co-frequency interference in NGSO systems using limited snapshots and without requiring channel state information, outperforming traditional methods under challenging conditions.

Authors:Almoatssimbillah Saifaldawla, Eva Lagunas, Flor Ortiz, Abuzar B. M. Adam, Symeon Chatzinotas
Title: SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems
Abstract:
In this paper, we investigate downlink co-frequency interference (CFI) mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems. Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null towards the direction of arrivals (DOAs) of the interfering signals, but they suffer from high computational complexity due to matrix inversions and required knowledge of the channel state information (CSI). Furthermore, adaptive beamformers, such as sample matrix inversion (SMI)-based minimum variance, provide poor performance when the available snapshots are limited. We propose a Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning (DL) approach and can be deployed on the user terminal (UT) antenna array, for assisting downlink beamforming and CFI mitigation using only a limited number of available array snapshots as input, and without CSI knowledge. Simulation results demonstrate that MambaBF consistently outperforms conventional beamforming techniques in mitigating interference and maximizing the signal-to-interference-plus-noise ratio (SINR), particularly under challenging conditions characterized by low SINR, limited snapshots, and imperfect CSI.
中文摘要:本文提出MambaBF波束成形器,这种基于无监督深度学习的方案能在非静止轨道卫星系统中仅利用有限快照且无需信道状态信息即可有效抑制下行同频干扰,在低信干噪比和有限样本等挑战性条件下性能优于传统方法。
English Summary: This paper introduces MambaBF, an unsupervised deep learning-based beamformer that effectively mitigates downlink co-frequency interference in NGSO systems using limited snapshots and without requiring channel state information, outperforming traditional methods under challenging conditions.

Authors:Zhihao Zeng, Ziquan Fang, Wei Shao, Lu Chen, Yunjun Gao
Title: FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning
Abstract:
Trajectory data, which capture the movement patterns of people and vehicles over time and space, are crucial for applications like traffic optimization and urban planning. However, issues such as noise and incompleteness often compromise data quality, leading to inaccurate trajectory analyses and limiting the potential of these applications. While Trajectory Data Preparation (TDP) can enhance data quality, existing methods suffer from two key limitations: (i) they do not address data privacy concerns, particularly in federated settings where trajectory data sharing is prohibited, and (ii) they typically design task-specific models that lack generalizability across diverse TDP scenarios. To overcome these challenges, we propose FedTDP, a privacy-preserving and unified framework that leverages the capabilities of Large Language Models (LLMs) for TDP in federated environments. Specifically, we: (i) design a trajectory privacy autoencoder to secure data transmission and protect privacy, (ii) introduce a trajectory knowledge enhancer to improve model learning of TDP-related knowledge, enabling the development of TDP-oriented LLMs, and (iii) propose federated parallel optimization to enhance training efficiency by reducing data transmission and enabling parallel model training. Experiments on 6 real datasets and 10 mainstream TDP tasks demonstrate that FedTDP consistently outperforms 13 state-of-the-art baselines.
中文摘要:FedTDP框架通过联邦学习环境中的大语言模型,解决了轨迹数据准备中的隐私保护和通用性问题,在多个数据集和任务上展现出优越性能。
English Summary: The FedTDP framework addresses privacy and generalizability issues in trajectory data preparation by using large language models in federated settings, demonstrating superior performance across multiple datasets and tasks.

Authors:Hao Xu, Arbind Agrahari Baniya, Sam Well, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal
Title: Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges
Abstract:
Video event detection is central to modern sports analytics, enabling automated understanding of key moments for performance evaluation, content creation, and tactical feedback. While deep learning has significantly advanced tasks like Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES), existing surveys often overlook the fine-grained temporal demands and domain-specific challenges posed by sports. This survey first provides a clear conceptual distinction between TAL, AS, and PES, then introduces a methods-based taxonomy covering recent deep learning approaches for AS and PES, including feature-based pipelines, end-to-end architectures, and multimodal strategies. We further review benchmark datasets and evaluation protocols, identifying critical limitations such as reliance on broadcast-quality footage and lenient multi-label metrics that hinder real-world deployment. Finally, we outline open challenges and future directions toward more temporally precise, generalizable, and practical event spotting in sports video analysis.
中文摘要:本综述区分了体育视频分析中的时序动作定位、动作识别和精确事件识别,回顾了深度学习方法与数据集,指出了依赖广播素材等局限性,并提出了提升时间精度和实用性的未来研究方向。
English Summary: This survey distinguishes between temporal action localization, action spotting, and precise event spotting in sports video analysis, reviews deep learning methods and datasets, and identifies limitations like broadcast dependency while outlining future directions for improved temporal precision and practicality.

Authors:Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li
Title: WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Abstract:
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.
中文: 本文提出了WebGen-Bench这一新颖基准,用于评估基于大语言模型的智能体从零创建多文件网站代码库的能力,最佳模型仅获得27.8%的通过率,同时证明通过特定训练数据可将性能提升至38.2%。
English: This paper introduces WebGen-Bench, a challenging benchmark for evaluating LLM-based agents' ability to generate multi-file website codebases, where the top-performing model achieved only 27.8% accuracy, and demonstrates that training on curated data can improve performance to 38.2%.

Authors:Aurora Rofena, Arianna Manchia, Claudia Lucia Piccolo, Bruno Beomonte Zobel, Paolo Soda, Valerio Guarrasi
Title: Lesion-Aware Generative Artificial Intelligence for Virtual Contrast-Enhanced Mammography in Breast Cancer
Abstract:
Contrast-Enhanced Spectral Mammography (CESM) is a dual-energy mammographic technique that improves lesion visibility through the administration of an iodinated contrast agent. It acquires both a low-energy image, comparable to standard mammography, and a high-energy image, which are then combined to produce a dual-energy subtracted image highlighting lesion contrast enhancement. While CESM offers superior diagnostic accuracy compared to standard mammography, its use entails higher radiation exposure and potential side effects associated with the contrast medium. To address these limitations, we propose Seg-CycleGAN, a generative deep learning framework for Virtual Contrast Enhancement in CESM. The model synthesizes high-fidelity dual-energy subtracted images from low-energy images, leveraging lesion segmentation maps to guide the generative process and improve lesion reconstruction. Building upon the standard CycleGAN architecture, Seg-CycleGAN introduces localized loss terms focused on lesion areas, enhancing the synthesis of diagnostically relevant regions. Experiments on the CESM@UCBM dataset demonstrate that Seg-CycleGAN outperforms the baseline in terms of PSNR and SSIM, while maintaining competitive MSE and VIF. Qualitative evaluations further confirm improved lesion fidelity in the generated images. These results suggest that segmentation-aware generative models offer a viable pathway toward contrast-free CESM alternatives.
中文: Seg-CycleGAN是一种基于病灶分割的生成式深度学习框架,能够从低能影像合成高质量的双能减影图像,在保持诊断性能的同时实现无对比剂的增强乳腺摄影。
English: Seg-CycleGAN is a deep learning framework that synthesizes virtual contrast-enhanced mammography images from standard low-energy images, using lesion segmentation to improve diagnostic accuracy while eliminating radiation and contrast agent risks.

Authors:Mohammadali Mohammadi, Le-Nam Tran, Zahra Mobini, Hien Quoc Ngo, Michail Matthaiou
Title: Cell-Free Massive MIMO-Assisted SWIPT for IoT Networks
Abstract:
This paper studies cell-free massive multiple-input multiple-output (CF-mMIMO) systems that underpin simultaneous wireless information and power transfer (SWIPT) for separate information users (IUs) and energy users (EUs) in Internet of Things (IoT) networks. We propose a joint access point (AP) operation mode selection and power control design, wherein certain APs are designated for energy transmission to EUs, while others are dedicated to information transmission to IUs. The performance of the system, from both a spectral efficiency (SE) and energy efficiency (EE) perspective, is comprehensively analyzed. Specifically, we formulate two mixed-integer nonconvex optimization problems for maximizing the average sum-SE and EE, under realistic power consumption models and constraints on the minimum individual SE requirements for individual IUs, minimum HE for individual EUs, and maximum transmit power at each AP. The challenging optimization problems are solved using successive convex approximation (SCA) techniques. The proposed framework design is further applied to the average sum-HE maximization and energy harvesting fairness problems. Our numerical results demonstrate that the proposed joint AP operation mode selection and power control algorithm can achieve EE performance gains of up to $4$-fold and $5$-fold over random AP operation mode selection, with and without power control respectively.
中文: 本文针对支持无线信息和能量同步传输的无蜂窝大规模MIMO系统,提出了联合接入点工作模式选择与功率控制方案,通过连续凸近似优化实现了能效的显著提升。
English: This paper proposes a joint access point operation mode selection and power control design for cell-free massive MIMO systems supporting simultaneous wireless information and power transfer, achieving significant energy efficiency gains through successive convex approximation optimization.

Authors:Valerio Guarrasi, Klara Mogensen, Sara Tassinari, Sara Qvarlander, Paolo Soda
Title: Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging
Abstract:
Multimodal deep learning harnesses diverse imaging modalities, such as MRI sequences, to enhance diagnostic accuracy in medical imaging. A key challenge is determining the optimal timing for integrating these modalities-specifically, identifying the network layers where fusion modules should be inserted. Current approaches often rely on manual tuning or exhaustive search, which are computationally expensive without any guarantee of converging to optimal results. We propose a sequential forward search algorithm that incrementally activates and evaluates candidate fusion modules at different layers of a multimodal network. At each step, the algorithm retrains from previously learned weights and compares validation loss to identify the best-performing configuration. This process systematically reduces the search space, enabling efficient identification of the optimal fusion timing without exhaustively testing all possible module placements. The approach is validated on two multimodal MRI datasets, each addressing different classification tasks. Our algorithm consistently identified configurations that outperformed unimodal baselines, late fusion, and a brute-force ensemble of all potential fusion placements. These architectures demonstrated superior accuracy, F-score, and specificity while maintaining competitive or improved AUC values. Furthermore, the sequential nature of the search significantly reduced computational overhead, making the optimization process more practical. By systematically determining the optimal timing to fuse imaging modalities, our method advances multimodal deep learning for medical imaging. It provides an efficient and robust framework for fusion optimization, paving the way for improved clinical decision-making and more adaptable, scalable architectures in medical AI applications.
中文: 所提出的顺序前向搜索算法通过逐步评估候选融合模块,有效确定多模态网络中的最佳融合时机,在显著降低计算成本的同时,在医学影像任务中超越了现有方法的性能表现。
English: The proposed sequential forward search algorithm efficiently identifies the optimal fusion timing in multimodal networks by incrementally evaluating candidate modules, significantly reducing computational costs while outperforming existing methods in medical imaging tasks.

Authors:David Nazareno Campo, Javier Conde, Álvaro Alonso, Gabriel Huecas, Joaquín Salvachúa, Pedro Reviriego
Title: Real-time Spatial Retrieval Augmented Generation for Urban Environments
Abstract:
The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
中文: 本文提出了一种基于FIWARE的实时空间RAG架构,通过整合时空数据提升城市基础模型性能,并以马德里旅游助手案例验证了其有效性。
English: This paper introduces a real-time spatial RAG architecture using FIWARE to enhance urban foundation models by integrating temporal and spatial data, validated through a Madrid tourism assistant case study.

Authors:Alice Natalina Caragliano, Claudia Tacconi, Carlo Greco, Lorenzo Nibid, Edy Ippolito, Michele Fiore, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi
Title: Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer
Abstract:
This study proposes a novel approach combining Multimodal Deep Learning with intrinsic eXplainable Artificial Intelligence techniques to predict pathological response in non-small cell lung cancer patients undergoing neoadjuvant therapy. Due to the limitations of existing radiomics and unimodal deep learning approaches, we introduce an intermediate fusion strategy that integrates imaging and clinical data, enabling efficient interaction between data modalities. The proposed Multimodal Doctor-in-the-Loop method further enhances clinical relevance by embedding clinicians' domain knowledge directly into the training process, guiding the model's focus gradually from broader lung regions to specific lesions. Results demonstrate improved predictive accuracy and explainability, providing insights into optimal data integration strategies for clinical applications.
中文摘要:本研究提出一种结合多模态深度学习与可解释人工智能的新方法,通过影像与临床数据的中层融合策略,并引入医生参与环节,显著提升了非小细胞肺癌新辅助治疗疗效预测的准确性与可解释性。
English Summary: This research introduces a multimodal deep learning framework with explainable AI that integrates imaging and clinical data through an intermediate fusion strategy, enhanced by clinician input, to more accurately predict treatment response in lung cancer patients while improving interpretability.

Authors:Elena Mulero Ayllón, Massimiliano Mantegna, Linlin Shen, Paolo Soda, Valerio Guarrasi, Matteo Tortora
Title: Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging
Abstract:
Accurate lung tumor segmentation is crucial for improving diagnosis, treatment planning, and patient outcomes in oncology. However, the complexity of tumor morphology, size, and location poses significant challenges for automated segmentation. This study presents a comprehensive benchmarking analysis of deep learning-based segmentation models, comparing traditional architectures such as U-Net and DeepLabV3, self-configuring models like nnUNet, and foundation models like MedSAM, and MedSAM~2. Evaluating performance across two lung tumor segmentation datasets, we assess segmentation accuracy and computational efficiency under various learning paradigms, including few-shot learning and fine-tuning. The results reveal that while traditional models struggle with tumor delineation, foundation models, particularly MedSAM~2, outperform them in both accuracy and computational efficiency. These findings underscore the potential of foundation models for lung tumor segmentation, highlighting their applicability in improving clinical workflows and patient outcomes.
中文摘要:本研究对肺部肿瘤分割的深度学习模型进行基准测试,发现基础模型如MedSAM~2在准确性和效率上均优于传统架构,凸显了其临床应用潜力。
English Summary: This study benchmarks deep learning models for lung tumor segmentation, finding that foundation models like MedSAM~2 surpass traditional architectures in accuracy and efficiency, highlighting their clinical potential.

Authors:Marco Salmè, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Title: Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages
Abstract:
The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaVA architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. We also explored the influence of the temperature parameter on the coherence of report generation, providing insights for optimal model settings. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations.
中文摘要:本研究评估了指令调优视觉语言模型在意大利语、德语和西班牙语放射学报告生成中的表现,发现结合医学术语训练的语言专用模型显著优于通用模型,同时探索了温度参数对报告连贯性的优化作用。
English Summary: This study benchmarks instruction-tuned Vision-Language Models for radiology report generation in Italian, German, and Spanish, finding that language-specific models with medical terminology training significantly outperform general models while also analyzing optimal temperature settings for report coherence.

Authors:Daniele Molino, Francesco di Feola, Linlin Shen, Paolo Soda, Valerio Guarrasi
Title: Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation
Abstract:
Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.
中文摘要:本研究提出一个专门用于生成多模态医学数据的框架,在生成高保真胸部X光片和临床连贯报告方面表现卓越,其生成数据在疾病分类任务中甚至可与真实数据相媲美。
English Summary: This study introduces a specialized framework for generating multimodal medical data, demonstrating superior performance in producing high-fidelity chest X-rays and clinically coherent reports that rival real data in diagnostic applications.

Authors:Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, Jian Guo
Title: Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
Abstract:
Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve the quality of synthetic data, we integrate two complementary strategies, Chain-of-Thought (CoT) and Contrastive Clarifying (CC), to enhance both reasoning capability and discriminative power. Extensive experiments demonstrate that SoG surpasses state-of-the-art (SOTA) methods on multi-hop and domain-specific question answering, while achieving competitive performance on long-context reading comprehension. These results highlight the superior generalization ability of SoG. Our work advances the paradigm of synthetic data generation and offers practical solutions for efficient knowledge acquisition in LLMs, particularly for downstream tasks and domains with limited training data.
中文摘要:合成图谱(SoG)框架通过整合跨文档知识关联和质量提升策略,改进了大型语言模型的合成数据生成,在专业任务和泛化能力上展现出卓越性能。
English Summary: The Synthetic-on-Graph (SoG) framework enhances synthetic data generation for Large Language Models by incorporating cross-document knowledge associations and quality improvement strategies, demonstrating superior performance in specialized tasks and generalization.

Authors:Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li
Title: TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Abstract:
Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
中文摘要:大型语言模型存在生成有害内容的安全隐患,为此我们提出TRIDENT自动化流程,从词汇多样性、恶意意图和越狱策略三个维度构建全面的安全对齐数据集,微调实验证明其能显著降低模型危害分数和攻击成功率。
English Summary: Large Language Models face safety risks from harmful content generation, prompting the development of TRIDENT—an automated pipeline creating comprehensive safety alignment datasets across lexical diversity, malicious intent, and jailbreak tactics, which significantly enhances model safety when fine-tuned.

Authors:Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lifeng Shang, Fisher Yu, Yunhe Wang
Title: Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning
Abstract:
Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing methods rely on static prompting rules or training with Wikipedia-based corpora and retrieval environments, limiting adaptability to the real-world web environment where ambiguity, conflicting evidence, and noise are prevalent. These constrained training settings hinder LLMs from learning to dynamically decide when and where to search, and how to adjust search depth and frequency based on informational demands. We define this missing capacity as Search Intensity Scaling (SIS)--the emergent skill to intensify search efforts under ambiguous or conflicting conditions, rather than settling on overconfident, under-verification answers. To study SIS, we introduce WebPuzzle, the first dataset designed to foster information-seeking behavior in open-world internet environments. WebPuzzle consists of 24K training instances and 275 test questions spanning both wiki-based and open-web queries. Building on this dataset, we propose DeepDiver, a Reinforcement Learning (RL) framework that promotes SIS by encouraging adaptive search policies through exploration under a real-world open-web environment. Experimental results show that Pangu-7B-Reasoner empowered by DeepDiver achieve performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's training curriculum from cold-start supervised fine-tuning to a carefully designed RL phase, and present that its capability of SIS generalizes from closed-form QA to open-ended tasks such as long-form writing. Our contributions advance adaptive information seeking in LLMs and provide a valuable benchmark and dataset for future research.
中文: 大语言模型在开放网络环境中因静态训练方法难以实现自适应信息检索,为此引入WebPuzzle数据集和DeepDiver强化学习框架,通过动态调整搜索强度来提升模型处理模糊冲突信息的能力。
English: Large language models struggle with adaptive information seeking in real-world web environments due to static training methods, prompting the development of WebPuzzle dataset and DeepDiver framework to enhance their ability to scale search intensity under ambiguous conditions.

Authors:Weihao Mao, Yang Lu, Yanqing Xu, Bo Ai, Octavia A. Dobre, Dusit Niyato
Title: Multi-Waveguide Pinching Antennas for ISAC
Abstract:
Recently, a novel flexible-antenna technology, called pinching antennas, has attracted growing academic interest. By inserting discrete dielectric materials, pinching antennas can be activated at arbitrary points along waveguides, allowing for flexible customization of large-scale path loss. This paper investigates a multi-waveguide pinching-antenna integrated sensing and communications (ISAC) system, where transmit pinching antennas (TPAs) and receive pinching antennas (RPAs) coordinate to simultaneously detect one potential target and serve one downlink user. We formulate a communication rate maximization problem subject to radar signal-to-noise ratio (SNR) requirement, transmit power budget, and the allowable movement region of the TPAs, by jointly optimizing TPA locations and transmit beamforming design. To address the non-convexity of the problem, we propose a novel fine-tuning approximation method to reformulate it into a tractable form, followed by a successive convex approximation (SCA)-based algorithm to obtain the solution efficiently. Extensive simulations validate both the system design and the proposed algorithm. Results show that the proposed method achieves near-optimal performance compared with the computational-intensive exhaustive search-based benchmark, and pinching-antenna ISAC systems exhibit a distinct communication-sensing trade-off compared with conventional systems.
中文摘要:本文研究多波导夹持天线通感一体化系统,通过联合优化发射天线位置和波束成形设计,提出精细调节近似方法和连续凸逼近算法,在满足感知信噪比要求的同时实现通信速率最大化,验证了该系统相比传统方案具有更显著的通感性能折衷。
English Summary: This paper proposes a multi-waveguide pinching-antenna ISAC system that coordinates transmit and receive antennas to simultaneously detect targets and serve users, introducing a fine-tuning approximation method and SCA-based algorithm to efficiently solve the joint optimization problem while demonstrating superior performance and a distinct communication-sensing trade-off.

Authors:Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, Hao Tan
Title: Test-Time Training Done Right
Abstract:
Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: https://tianyuanzhang.com/projects/ttt-done-right
中文: 测试时训练(TTT)通过在推理过程中调整模型权重来处理上下文依赖,但现有方法因小批量更新而对长上下文数据效率低下,而我们的大块测试时训练(LaCT)采用大块更新,显著提高了硬件利用率、状态容量和跨任务可扩展性,无需复杂实现。
English: Test-Time Training (TTT) adapts model weights during inference to handle context dependencies, but existing methods are inefficient for long-context data due to small minibatch updates, whereas our Large Chunk Test-Time Training (LaCT) uses large chunk updates to improve hardware utilization, state capacity, and scalability across diverse tasks without complex implementations.

Authors:Mohamad Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi
Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
Abstract:
Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
中文摘要:SITAlign提出了一种推理时框架,通过最大化主要目标同时满足次要标准的阈值约束,有效弥补了人类偏好对齐中的不足,并展现出优于现有方法的性能表现。
English Summary: SITAlign introduces an inference-time framework that maximizes a primary objective while satisfying threshold constraints on secondary criteria, effectively bridging the gap in human preference alignment by incorporating satisficing strategies and demonstrating superior performance over existing methods.

Authors:Xiaorui Wu, Xiaofeng Mao, Xin Zhang, Fei Li, Chong Teng, Yuxiang Peng, Li Zheng, Donghong Ji, Zhuang Li
Title: EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
Abstract:
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.
中文摘要:大语言模型常因保守的安全对齐过度拒绝无害指令,为此我们提出EVOREFUSE进化优化方法,能生成多样化的伪恶意指令,在保持安全性的同时有效评估并减少过度拒绝问题。
English Summary: Large language models often over-refuse harmless queries due to conservative safety alignment, so we developed EVOREFUSE, an evolutionary prompt optimization method that generates diverse pseudo-malicious instructions to better evaluate and reduce unnecessary refusals while maintaining safety.

Authors:Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu
Title: Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition
Abstract:
This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used to adjust feature granularity. Experiments on the IEMOCAP and MELD benchmarks demonstrate that our approach consistently outperforms comparable LLaMA-based SER baselines, including those using either (a) alternating multi-task fine-tuning alone or (b) feature disentanglement only. Statistically significant increase of SER unweighted accuracy by up to 4.0% and 3.7% absolute (5.4% and 6.6% relative) are obtained. More importantly, emotion descriptors offer further explainability for SER.
中文: 本文提出一种端到端的LLM赋能可解释语音情感识别方法,通过细粒度特征解耦与联合优化,在提升识别准确率的同时利用情感描述符增强模型可解释性。
English: This paper introduces an end-to-end LLM-based explainable speech emotion recognition method that enhances accuracy and interpretability by jointly optimizing emotion recognition and descriptor prediction through fine-grained feature disentanglement and compression.

Authors:Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie
Title: Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
Abstract:
Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research.
中文: 本文提出首个基于大语言模型的泰语语音识别系统EThai-ASR,通过自演进数据优化策略解决数据稀缺问题,并采用可插拔序列压缩模块降低计算需求,在多个数据集上实现了最优性能。
English: This paper introduces EThai-ASR, the first LLM-based Thai speech recognition system that tackles data scarcity through self-evolving data refinement and reduces computational demands with a pluggable sequence compression module, achieving state-of-the-art results.

Authors:Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal
Title: EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Abstract:
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
Chinese: EPiC提出了一种无需昂贵相机轨迹标注的高效框架,通过基于首帧可见性对源视频进行掩码来生成高质量锚点视频,并结合轻量级Anchor-ControlNet模块,在视频扩散模型中实现精确的3D相机控制,以最少的资源达到了最先进的性能。
English: EPiC introduces an efficient framework that generates high-quality anchor videos without costly camera annotations by masking source videos and integrates a lightweight Anchor-ControlNet module to enable precise 3D camera control in video diffusion models, achieving state-of-the-art performance with minimal resources.

Authors:Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu
Title: Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision
Abstract:
Model compression has become an emerging need as the sizes of modern speech systems rapidly increase. In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. We propose novel approaches to perform extremely low-bit (i.e., 2-bit and 1-bit) quantization of Conformer automatic speech recognition systems using multiple precision model co-training, stochastic precision, and tensor-wise learnable scaling factors to alleviate quantization incurred performance loss. The proposed methods can achieve performance-lossless 2-bit and 1-bit quantization of Conformer ASR systems trained with the 300-hr Switchboard and 960-hr LibriSpeech corpus. Maximum overall performance-lossless compression ratios of 16.2 and 16.6 times are achieved without a statistically significant increase in the word error rate (WER) over the full precision baseline systems, respectively.
Chinese: 本文提出了实现Conformer自动语音识别系统无损性能的2位和1位量化新方法,在无显著词错误率增加的情况下,压缩比最高可达16.6倍。
English: This paper introduces novel methods for achieving performance-lossless 2-bit and 1-bit quantization of Conformer ASR systems, enabling compression ratios up to 16.6 times without significant word error rate increases.

Authors:Adeela Islam, Stefano Fiorini, Stuart James, Pietro Morerio, Alessio Del Bue
Title: ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
Abstract:
The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.
中文摘要:本研究提出ReassembleNet方法,通过轮廓关键点和图神经网络解决重组任务中的可扩展性、多模态及实际应用限制,在旋转和平移精度上分别提升55%和86%。
English Summary: This study introduces ReassembleNet, a deep learning method that overcomes scalability, multimodality, and real-world applicability limitations in reassembly tasks by using contour keypoints and graph neural networks, achieving 55% and 86% improvements in rotation and translation accuracy.

Authors:Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, Yang Gao
Title: MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
Abstract:
As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.
中文摘要:MedSentry提出了一套基准测试和评估体系,用于检验医疗领域多智能体大语言模型系统的安全性,揭示了不同架构的脆弱性并提出检测修正机制以提升防护能力。
English Summary: MedSentry introduces a benchmark and evaluation pipeline to assess the safety of multi-agent LLM systems in healthcare, revealing vulnerabilities across different topologies and proposing a detection-correction mechanism for enhanced security.

Authors:Jingze Ding, Zijian Zhou, Xiaodan Shao, Bingli Jiao, Rui Zhang
Title: Polarforming for Wireless Networks: Opportunities and Challenges
Abstract:
Polarforming emerges as a promising technique for manipulating the polarization of electromagnetic (EM) waves by shaping the polarization of an antenna into a desired state. By dynamically adjusting antenna polarization, polarforming enables real-time polarization matching or mismatching with received EM waves, thereby leveraging polarization degrees of freedom (DoFs) to enhance wireless communication performance. In this article, we first present an overview of the fundamental principles and design approaches underlying the polarforming technique. We then analyze the key advantages of polarforming, including hardware cost reduction, depolarization mitigation, channel adaptation, signal power enhancement, and interference suppression. Furthermore, we explore promising applications of polarforming for next-generation wireless networks. Numerical case studies demonstrate the substantial performance gains of polarforming over conventional fixed-polarization antenna (FPA) systems, along with a discussion of implementation challenges to motivate future research.
中文摘要:极化成形是一种通过动态调整天线极化状态来优化无线通信性能的新兴技术,它能够增强信号功率、抑制干扰并改善信道适应性,相比传统固定极化天线系统展现出显著优势。
English Summary: Polarforming is an innovative technique that dynamically adjusts antenna polarization to optimize wireless communication by enhancing signal power, reducing interference, and improving channel adaptation, as demonstrated through performance gains over traditional systems.

Authors:Haixin Wang, Ruoyan Li, Fred Xu, Fang Sun, Kaiqiao Han, Zijie Huang, Guancheng Wan, Ching Chang, Xiao Luo, Wei Wang, Yizhou Sun
Title: FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation
Abstract:
Data-driven modeling of fluid dynamics has advanced rapidly with neural PDE solvers, yet a fair and strong benchmark remains fragmented due to the absence of unified PDE datasets and standardized evaluation protocols. Although architectural innovations are abundant, fair assessment is further impeded by the lack of clear disentanglement between spatial, temporal and loss modules. In this paper, we introduce FD-Bench, the first fair, modular, comprehensive and reproducible benchmark for data-driven fluid simulation. FD-Bench systematically evaluates 85 baseline models across 10 representative flow scenarios under a unified experimental setup. It provides four key contributions: (1) a modular design enabling fair comparisons across spatial, temporal, and loss function modules; (2) the first systematic framework for direct comparison with traditional numerical solvers; (3) fine-grained generalization analysis across resolutions, initial conditions, and temporal windows; and (4) a user-friendly, extensible codebase to support future research. Through rigorous empirical studies, FD-Bench establishes the most comprehensive leaderboard to date, resolving long-standing issues in reproducibility and comparability, and laying a foundation for robust evaluation of future data-driven fluid models. The code is open-sourced at https://anonymous.4open.science/r/FD-Bench-15BC.
中文: FD-Bench推出了首个公平、模块化且全面的数据驱动流体模拟基准,通过系统评估10种流动场景下的85个模型,解决了可复现性问题并建立了稳健的评估体系。
English: FD-Bench introduces the first fair, modular, and comprehensive benchmark for data-driven fluid simulation, systematically evaluating 85 models across 10 flow scenarios to resolve reproducibility issues and establish a robust evaluation framework.

Authors:Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
Title: Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Abstract:
Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.
中文: 本立场文件主张在稀疏自编码器中优先考虑特征一致性,以确保在不同训练运行中可靠地收敛到等效特征集,提出配对字典平均相关系数作为实用指标,并证明其在合成和真实大语言模型数据中均能实现高一致性和语义相似性。
English: This position paper advocates for prioritizing feature consistency in Sparse Autoencoders to ensure reliable convergence to equivalent feature sets across training runs, proposing the Pairwise Dictionary Mean Correlation Coefficient as a practical metric and demonstrating its effectiveness in achieving high consistency and semantic similarity in both synthetic and real-world LLM data.

Authors:Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang
Title: Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
Abstract:
Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.
Chinese: 本研究提出了一种名为MVC-ZigAL的强化学习微调框架,通过约束策略优化和创新的采样技术,优化少步文本到多视图扩散模型,在保持效率的同时显著提升图像质量和视图一致性。
English: This study introduces a reinforcement learning finetuning framework called MVC-ZigAL that optimizes few-step text-to-multiview diffusion models to enhance both image fidelity and view consistency through constrained policy optimization and a novel sampling technique.

Authors:Juntong Wang, Xiyuan Wang, Muhan Zhang
Title: OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction
Abstract:
Common Neighbors (CNs) and their higher-order variants are important pairwise features widely used in state-of-the-art link prediction methods. However, existing methods often struggle with the repetition across different orders of CNs and fail to fully leverage their potential. We identify that these limitations stem from two key issues: redundancy and over-smoothing in high-order common neighbors. To address these challenges, we design orthogonalization to eliminate redundancy between different-order CNs and normalization to mitigate over-smoothing. By combining these two techniques, we propose Orthogonal Common Neighbor (OCN), a novel approach that significantly outperforms the strongest baselines by an average of 7.7% on popular link prediction benchmarks. A thorough theoretical analysis is provided to support our method. Ablation studies also verify the effectiveness of our orthogonalization and normalization techniques.
Chinese: 提出的正交共同邻居(OCN)方法通过正交化和归一化技术解决了共同邻居特征中的冗余和平滑过度问题,在链接预测基准测试中平均优于最强基线7.7%。
English: The proposed Orthogonal Common Neighbor (OCN) method addresses redundancy and over-smoothing in common neighbor features through orthogonalization and normalization, achieving a 7.7% average improvement over top baselines in link prediction benchmarks.

Authors:Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, Jiajun Wu
Title: WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
Abstract:
WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: https://kyleleey.github.io/WonderPlay/
中文: WonderPlay是一种创新框架,通过整合物理模拟与视频生成技术,能够从单张图像创建动态3D场景,实现直观的用户控制并处理包括布料、液体在内的多种材质。
English: WonderPlay is an innovative framework that combines physics simulation and video generation to create dynamic 3D scenes from a single image, allowing intuitive user control and handling diverse materials like cloth and liquids.

Authors:Congren Dai, Huichi Zhou, Jiahao Huang, Zhenxuan Zhang, Fanwen Wang, Guang Yang, Fei Ye
Title: Dynamic Dual Buffer with Divide-and-Conquer Strategy for Online Continual Learning
Abstract:
Online Continual Learning (OCL) presents a complex learning environment in which new data arrives in a batch-to-batch online format, and the risk of catastrophic forgetting can significantly impair model efficacy. In this study, we address OCL by introducing an innovative memory framework that incorporates a short-term memory system to retain dynamic information and a long-term memory system to archive enduring knowledge. Specifically, the long-term memory system comprises a collection of sub-memory buffers, each linked to a cluster prototype and designed to retain data samples from distinct categories. We propose a novel $K$-means-based sample selection method to identify cluster prototypes for each encountered category. To safeguard essential and critical samples, we introduce a novel memory optimisation strategy that selectively retains samples in the appropriate sub-memory buffer by evaluating each cluster prototype against incoming samples through an optimal transportation mechanism. This approach specifically promotes each sub-memory buffer to retain data samples that exhibit significant discrepancies from the corresponding cluster prototype, thereby ensuring the preservation of semantically rich information. In addition, we propose a novel Divide-and-Conquer (DAC) approach that formulates the memory updating as an optimisation problem and divides it into several subproblems. As a result, the proposed DAC approach can solve these subproblems separately and thus can significantly reduce computations of the proposed memory updating process. We conduct a series of experiments across standard and imbalanced learning settings, and the empirical findings indicate that the proposed memory framework achieves state-of-the-art performance in both learning contexts.
中文: 本研究提出了一种创新的在线持续学习双记忆框架,结合短期与长期记忆系统,采用新型K均值样本选择方法和分治优化策略,有效防止灾难性遗忘,并在多种学习场景中实现了最先进的性能。
English: This study introduces an innovative dual-memory framework for Online Continual Learning, combining short-term and long-term systems with a novel K-means sample selection method and a Divide-and-Conquer optimization strategy to effectively prevent catastrophic forgetting while achieving state-of-the-art performance across various learning scenarios.

Authors:Manuel Lecha, Andrea Cavallo, Francesca Dominici, Ran Levi, Alessio Del Bue, Elvin Isufi, Pietro Morerio, Claudio Battiloro
Title: Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding
Abstract:
Graph Neural Networks (GNNs) excel at learning from pairwise interactions but often overlook multi-way and hierarchical relationships. Topological Deep Learning (TDL) addresses this limitation by leveraging combinatorial topological spaces. However, existing TDL models are restricted to undirected settings and fail to capture the higher-order directed patterns prevalent in many complex systems, e.g., brain networks, where such interactions are both abundant and functionally significant. To fill this gap, we introduce Semi-Simplicial Neural Networks (SSNs), a principled class of TDL models that operate on semi-simplicial sets -- combinatorial structures that encode directed higher-order motifs and their directional relationships. To enhance scalability, we propose Routing-SSNs, which dynamically select the most informative relations in a learnable manner. We prove that SSNs are strictly more expressive than standard graph and TDL models. We then introduce a new principled framework for brain dynamics representation learning, grounded in the ability of SSNs to provably recover topological descriptors shown to successfully characterize brain activity. Empirically, SSNs achieve state-of-the-art performance on brain dynamics classification tasks, outperforming the second-best model by up to 27%, and message passing GNNs by up to 50% in accuracy. Our results highlight the potential of principled topological models for learning from structured brain data, establishing a unique real-world case study for TDL. We also test SSNs on standard node classification and edge regression tasks, showing competitive performance. We will make the code and data publicly available.
中文: 本文提出的半单纯神经网络通过捕捉大脑网络等复杂系统中的有向高阶交互,在拓扑深度学习框架下展现出更强的表达能力,并在脑动力学分类任务中实现了最佳性能。
English: This paper introduces Semi-Simplicial Neural Networks (SSNs), a novel topological deep learning approach that captures directed higher-order interactions in complex systems like brain networks, demonstrating superior expressiveness and achieving state-of-the-art performance in brain dynamics classification.

Authors:Seth Siriya, Julian D. Schiller, Victor G. Lopez, Matthias A. Müller
Title: Sufficient Conditions for Detectability of Approximately Discretized Nonlinear Systems
Abstract:
In many sampled-data applications, observers are designed based on approximately discretized models of continuous-time systems, where usually only the discretized system is analyzed in terms of its detectability. In this paper, we show that if the continuous-time system satisfies certain linear matrix inequality (LMI) conditions, and the sampling period of the discretization scheme is sufficiently small, then the whole family of discretized systems (parameterized by the sampling period) satisfies analogous discrete-time LMI conditions that imply detectability. Our results are applicable to general discretization schemes, as long as they produce approximate models whose linearizations are in some sense consistent with the linearizations of the continuous-time ones. We explicitly show that the Euler and second-order Runge-Kutta methods satisfy this condition. A batch-reactor system example is provided to highlight the usefulness of our results from a practical perspective.
中文摘要:本文证明,只要连续时间系统满足特定线性矩阵不等式条件且采样周期足够小,其离散化模型就能保持可检测性,适用于包括欧拉法和龙格-库塔法在内的通用离散化方法,并通过反应器实例验证了实际应用价值。
English Summary: This paper demonstrates that if a continuous-time system meets specific linear matrix inequality conditions and uses sufficiently small sampling periods, its discretized models will maintain detectability across general discretization methods, with validation provided through Euler and Runge-Kutta methods and a practical reactor example.

Authors:Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Title: Deeper Diffusion Models Amplify Bias
Abstract:
Despite the impressive performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme the diffusion models may amplify the inherent bias in the training data and, on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond ``generalization'', revealing the risk of bias amplification in deeper models. Building on the insights, we also introduce a training-free method to improve output quality in text-to-image and image-to-image generation. By progressively encouraging temporary high variance in the generation process with partial bypassing of the mid-block's contribution in the denoising process of DMs, our method consistently improves generative image quality with zero training cost. Our claims are validated both theoretically and empirically.
中文摘要:本文探讨扩散模型中的偏差-方差权衡问题,揭示其可能放大训练数据固有偏差或损害数据隐私,并通过理论与实验验证了这些发现。
English Summary: This paper investigates the bias-variance tradeoff in diffusion models, revealing that they can either amplify training data bias or compromise data privacy, with theoretical and empirical validation.

Authors:Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Title: Deeper Diffusion Models Amplify Bias
Abstract:
Despite the remarkable performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme, the diffusion models may amplify the inherent bias in the training data, and on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond "generalization", revealing the risk of bias amplification in deeper models. Our claims are validated both theoretically and empirically.
中文摘要:本文探讨扩散模型中的偏差-方差权衡问题,揭示其可能放大训练数据固有偏差或损害数据隐私,并通过理论与实验验证了这些发现。
English Summary: This paper investigates the bias-variance tradeoff in diffusion models, revealing that they can either amplify training data bias or compromise data privacy, with theoretical and empirical validation.

Authors:Zhongpai Gao, Meng Zheng, Benjamin Planche, Anwesa Choudhuri, Terrence Chen, Ziyan Wu
Title: Render-FM: A Foundation Model for Real-time Photorealistic Volumetric Rendering
Abstract:
Volumetric rendering of Computed Tomography (CT) scans is crucial for visualizing complex 3D anatomical structures in medical imaging. Current high-fidelity approaches, especially neural rendering techniques, require time-consuming per-scene optimization, limiting clinical applicability due to computational demands and poor generalizability. We propose Render-FM, a novel foundation model for direct, real-time volumetric rendering of CT scans. Render-FM employs an encoder-decoder architecture that directly regresses 6D Gaussian Splatting (6DGS) parameters from CT volumes, eliminating per-scan optimization through large-scale pre-training on diverse medical data. By integrating robust feature extraction with the expressive power of 6DGS, our approach efficiently generates high-quality, real-time interactive 3D visualizations across diverse clinical CT data. Experiments demonstrate that Render-FM achieves visual fidelity comparable or superior to specialized per-scan methods while drastically reducing preparation time from nearly an hour to seconds for a single inference step. This advancement enables seamless integration into real-time surgical planning and diagnostic workflows. The project page is: https://gaozhongpai.github.io/renderfm/.
中文: Render-FM是一种基础模型,通过直接从CT体积回归6D高斯泼溅参数,实现实时高保真体积渲染,无需逐场景优化,并将准备时间从数小时缩短至数秒。
English: Render-FM is a foundation model that enables real-time, high-fidelity volumetric rendering of CT scans by directly regressing 6D Gaussian Splatting parameters, eliminating per-scene optimization and reducing preparation time from hours to seconds.

Authors:Suhao Yu, Haojin Wang, Juncheng Wu, Cihang Xie, Yuyin Zhou
Title: MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
Abstract:
Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
中文摘要:MedFrameQA作为首个评估医学视觉问答中多图像推理能力的基准,揭示了现有模型表现不佳(准确率大多低于50%),且在跨图像证据整合方面存在显著缺陷。
English Summary: The MedFrameQA benchmark is introduced to evaluate multi-image reasoning in medical VQA, revealing that current models perform poorly with accuracies below 50% and struggle with evidence integration across image sequences.

Authors:Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li
Title: DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
Abstract:
While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.
中文: DetailMaster是首个专门评估文本到图像模型处理长而复杂提示语能力的综合基准,揭示了在属性绑定和空间推理等关键维度上,随着提示语长度增加,模型性能显著下降的问题。
English: DetailMaster is the first comprehensive benchmark designed to evaluate text-to-image models' ability to handle long, detail-rich prompts, revealing significant performance degradation in key dimensions like attribute binding and spatial reasoning as prompt length increases.

Authors:Xiaodan Shao, Rui Zhang, Jihong Park, Tony Q. S. Quek, Robert Schober, Xuemin Shen
Title: Directional Sparsity Based Statistical Channel Estimation for 6D Movable Antenna Communications
Abstract:
Six-dimensional movable antenna (6DMA) is an innovative and transformative technology to improve wireless network capacity by adjusting the 3D positions and 3D rotations of antennas/surfaces (sub-arrays) based on the channel spatial distribution. For optimization of the antenna positions and rotations, the acquisition of statistical channel state information (CSI) is essential for 6DMA systems. In this paper, we unveil for the first time a new \textbf{\textit{directional sparsity}} property of the 6DMA channels between the base station (BS) and the distributed users, where each user has significant channel gains only with a (small) subset of 6DMA position-rotation pairs, which can receive direct/reflected signals from the user. By exploiting this property, a covariance-based algorithm is proposed for estimating the statistical CSI in terms of the average channel power at a small number of 6DMA positions and rotations. Based on such limited channel power estimation, the average channel powers for all possible 6DMA positions and rotations in the BS movement region are reconstructed by further estimating the multi-path average power and direction-of-arrival (DOA) vectors of all users. Simulation results show that the proposed directional sparsity-based algorithm can achieve higher channel power estimation accuracy than existing benchmark schemes, while requiring a lower pilot overhead.
Chinese Summary: 六维可移动天线(6DMA)技术通过优化天线位置与旋转提升无线容量,本文提出的基于方向稀疏性的算法相比现有方案能以更低导频开销实现更高精度的信道功率估计。
English Summary: Six-dimensional movable antenna (6DMA) technology enhances wireless capacity through optimized antenna positioning and rotation, with a proposed directional sparsity-based algorithm achieving higher channel power estimation accuracy and lower pilot overhead than existing methods.

Authors:Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen, Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen McKeown, Heng Ji
Title: ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Abstract:
Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.
中文: 本文提出了ModelingBench基准测试,用于评估需要跨学科推理的现实数学建模问题,并开发了ModelingAgent多智能体框架,其通过专家循环评估系统ModelingJudge生成的解决方案优于基线,达到专家水平。
English: This paper introduces ModelingBench, a benchmark for real-world mathematical modeling problems requiring interdisciplinary reasoning, and ModelingAgent, a multi-agent framework that outperforms baselines by producing expert-level solutions evaluated through the expert-in-the-loop ModelingJudge system.

Authors:Jiafeng Liang, Shixin Jiang, Xuan Dong, Ning Wang, Zheng Chu, Hui Su, Jinlan Fu, Ming Liu, See-Kiong Ng, Bing Qin
Title: Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency
Abstract:
Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.
中文: 本研究提出TemRobBench评估大型多模态模型的时间鲁棒性,发现其过度依赖先验知识而忽略视频动态,并通过PanoDPO方法整合视觉与语言特征来有效提升模型可靠性。
English: This study introduces TemRobBench to evaluate the temporal robustness of Large Multimodal Models (LMMs) and finds they over-rely on prior knowledge while neglecting video dynamics, proposing PanoDPO to enhance their reliability by integrating visual and linguistic features.

Authors:Jiangxia Cao, Pengbo Xu, Yin Cheng, Kaiwei Guo, Jian Tang, Shijun Wang, Dewei Leng, Shuang Yang, Zhaojie Liu, Yanan Niu, Guorui Zhou, Kun Gai
Title: Pantheon: Personalized Multi-objective Ensemble Sort via Iterative Pareto Policy Optimization
Abstract:
In this paper, we provide our milestone ensemble sort work and the first-hand practical experience, Pantheon, which transforms ensemble sorting from a "human-curated art" to a "machine-optimized science". Compared with formulation-based ensemble sort, our Pantheon has the following advantages: (1) Personalized Joint Training: our Pantheon is jointly trained with the real-time ranking model, which could capture ever-changing user personalized interests accurately. (2) Representation inheritance: instead of the highly compressed Pxtrs, our Pantheon utilizes the fine-grained hidden-states as model input, which could benefit from the Ranking model to enhance our model complexity. Meanwhile, to reach a balanced multi-objective ensemble sort, we further devise an \textbf{iterative Pareto policy optimization} (IPPO) strategy to consider the multiple objectives at the same time. To our knowledge, this paper is the first work to replace the entire formulation-based ensemble sort in industry RecSys, which was fully deployed at Kuaishou live-streaming services, serving 400 Million users daily.
中文: 本文提出Pantheon系统,通过个性化联合训练和表征继承将集成排序从人工策展转变为机器优化,并已全面部署于快手直播服务,每日服务四亿用户。
English: This paper introduces Pantheon, a machine-optimized ensemble sorting system that replaces human-curated methods with personalized joint training and representation inheritance, fully deployed in Kuaishou's live-streaming services for 400 million daily users.

Authors:Zhi Su, Yuman Gao, Emily Lukas, Yunfei Li, Jiaze Cai, Faris Tulbah, Fei Gao, Chao Yu, Zhongyu Li, Yi Wu, Koushil Sreenath
Title: Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams
Abstract:
Achieving coordinated teamwork among legged robots requires both fine-grained locomotion control and long-horizon strategic decision-making. Robot soccer offers a compelling testbed for this challenge, combining dynamic, competitive, and multi-agent interactions. In this work, we present a hierarchical multi-agent reinforcement learning (MARL) framework that enables fully autonomous and decentralized quadruped robot soccer. First, a set of highly dynamic low-level skills is trained for legged locomotion and ball manipulation, such as walking, dribbling, and kicking. On top of these, a high-level strategic planning policy is trained with Multi-Agent Proximal Policy Optimization (MAPPO) via Fictitious Self-Play (FSP). This learning framework allows agents to adapt to diverse opponent strategies and gives rise to sophisticated team behaviors, including coordinated passing, interception, and dynamic role allocation. With an extensive ablation study, the proposed learning method shows significant advantages in the cooperative and competitive multi-agent soccer game. We deploy the learned policies to real quadruped robots relying solely on onboard proprioception and decentralized localization, with the resulting system supporting autonomous robot-robot and robot-human soccer matches on indoor and outdoor soccer courts.
中文: 本研究提出了一种分层多智能体强化学习框架,通过结合底层运动技能与高层策略规划,实现了四足机器人自主进行足球比赛的能力,在真实环境中展现出有效的协调性与适应性。
English: This study introduces a hierarchical multi-agent reinforcement learning framework that enables autonomous quadruped robots to play soccer by combining low-level locomotion skills with high-level strategic planning, demonstrating effective coordination and adaptability in real-world environments.

Authors:Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma
Title: CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process
Abstract:
Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM's overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.
Chinese: 近期的大推理模型通过生成推理轨迹提升了复杂任务解决能力,但现有评估方法无法充分衡量推理过程的合理性,导致对答案可靠性的判断不准确。本文受经典力学启发,提出了一种CoT-Kinetics能量方程,通过将标记状态转换建模为粒子动力学,专门评估推理阶段的合理性以精确衡量模型输出质量。
English: Recent Large Reasoning Models enhance complex task-solving by generating reasoning trajectories, but current evaluation methods inadequately assess the soundness of these reasoning paths, leading to unreliable confidence in derived answers. This paper introduces a CoT-Kinetics energy equation inspired by classical mechanics to accurately evaluate reasoning soundness and overall output quality by modeling token state transformations as particle dynamics.

Authors:Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Title: Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy
Abstract:
Recent advances in neural audio codec-based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec-based deepfake, or CodecFake. Although existing anti-spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech-to-unit encoding, discrete unit modeling, and unit-to-speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.
中文: 本文提出通过解析神经音频编解码器组件来追踪基于编解码器的深度伪造语音来源的新方法,初步验证了其可行性并指出了未来研究需解决的关键挑战。
English: This paper introduces a novel approach for tracing the source of codec-based deepfake speech by analyzing neural audio codec components, demonstrating initial feasibility while identifying key challenges for future research.

Authors:Yinzhe Wu, Jiahao Huang, Fanwen Wang, Mengze Gao, Congyu Liao, Guang Yang, Kawin Setsompop
Title: Enhancing Diffusion-Weighted Images (DWI) for Diffusion MRI: Is it Enough without Non-Diffusion-Weighted B=0 Reference?
Abstract:
Diffusion MRI (dMRI) is essential for studying brain microstructure, but high-resolution imaging remains challenging due to the inherent trade-offs between acquisition time and signal-to-noise ratio (SNR). Conventional methods often optimize only the diffusion-weighted images (DWIs) without considering their relationship with the non-diffusion-weighted (b=0) reference images. However, calculating diffusion metrics, such as the apparent diffusion coefficient (ADC) and diffusion tensor with its derived metrics like fractional anisotropy (FA) and mean diffusivity (MD), relies on the ratio between each DWI and the b=0 image, which is crucial for clinical observation and diagnostics. In this study, we demonstrate that solely enhancing DWIs using a conventional pixel-wise mean squared error (MSE) loss is insufficient, as the error in ratio between generated DWIs and b=0 diverges. We propose a novel ratio loss, defined as the MSE loss between the predicted and ground-truth log of DWI/b=0 ratios. Our results show that incorporating the ratio loss significantly improves the convergence of this ratio error, achieving lower ratio MSE and slightly enhancing the peak signal-to-noise ratio (PSNR) of generated DWIs. This leads to improved dMRI super-resolution and better preservation of b=0 ratio-based features for the derivation of diffusion metrics.
中文: 本研究提出了一种新颖的比率损失函数,通过更好地保持扩散加权图像与参考图像之间的关键关系,改进了扩散MRI超分辨率技术,提高了衍生扩散指标的准确性。
English: This study introduces a novel ratio loss function that improves diffusion MRI super-resolution by better preserving the critical relationship between diffusion-weighted and reference images, enhancing the accuracy of derived diffusion metrics.

Authors:Muleilan Pei, Jiayao Shan, Peiliang Li, Jieqi Shi, Jing Huo, Yang Gao, Shaojie Shen
Title: SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving
Abstract:
Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) Map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird's-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.
中文: 本研究提出的SEPT框架通过混合特征融合策略和交叉路口感知关键点检测,将标准精度地图作为先验知识有效集成到感知推理流程中,在OpenLane-V2数据集上显著提升了场景感知与拓扑推理性能。
English: The proposed SEPT framework enhances autonomous vehicle scene perception and topology reasoning by integrating Standard-Definition maps as prior knowledge through hybrid feature fusion and intersection-aware keypoint detection, achieving state-of-the-art performance on the OpenLane-V2 dataset.

Authors:Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, Youngjo Kim, Hyung-Ju Chun, Xin Jin, Chongyi Li, Chun-Le Guo, Radu Timofte, Qi Wu, Tianheng Qiu, Yuchun Dong, Shenglin Ding, Guanghua Pan, Weiyu Zhou, Tao Hu, Yixu Feng, Duwei Dai, Yu Cao, Peng Wu, Wei Dong, Yanning Zhang, Qingsen Yan, Simon J. Larsen, Ruixuan Jiang, Senyan Xu, Xingbo Wang, Xin Lu, Marcos V. Conde, Javier Abad-Hernandez, Alvaro Garcıa-Lara, Daniel Feijoo, Alvaro Garcıa, Zeyu Xiao, Zhuoyuan Li
Title: NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results
Abstract:
This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration.
中文: NTIRE 2025挑战赛通过新型RAW数据集推动高效多帧HDR与图像复原技术发展,顶尖方案在严格计算限制下实现了卓越性能。
English: The NTIRE 2025 Challenge advanced efficient multi-frame HDR and restoration techniques using a novel RAW dataset, with top solutions achieving high performance under strict computational constraints.

Authors:Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, Zhou Zhao
Title: Depth Anything with Any Prior
Abstract:
This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.
中文摘要:Prior Depth Anything框架通过粗细结合的流程,将不完整但精确的度量深度信息与完整但相对的深度预测相结合,生成高精度稠密深度图,在七大现实数据集上展现出卓越的零样本泛化能力,其性能媲美甚至超越特定任务方法。
English Summary: The Prior Depth Anything framework integrates precise but incomplete metric depth data with complete relative depth predictions to produce highly accurate and detailed metric depth maps through a coarse-to-fine pipeline, demonstrating superior zero-shot generalization across multiple tasks and datasets.

Authors:Zehan Wang, Ke Lei, Chen Zhu, Jiawei Huang, Sashuai Zhou, Luping Liu, Xize Cheng, Shengpeng Ji, Zhenhui Ye, Tao Jin, Zhou Zhao
Title: T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
Abstract:
Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic&Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A-FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A-EpicBench, a benchmark that focuses on long captions, multi-events, and story-telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A-FeedBack can enhance current state-of-the-art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A-EpicBench) scenarios.
中文: 针对文本转音频模型在处理复杂多事件音频时的不足,本研究提出采用AI反馈学习,通过细粒度评分流程和大规模偏好数据集,显著提升了模型在提示跟随和音质方面的表现。
English: To address the limitations of text-to-audio models in handling complex multi-event audio, this study introduces AI feedback learning with fine-grained scoring pipelines and a large preference dataset, significantly improving model performance in prompt-following and acoustic quality.

Authors:Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu
Title: Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Abstract:
Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.
Chinese: 扩散策略虽能从演示数据中学习多样化技能,但在强化学习微调时存在样本效率低和优化困难的问题;NCDPO框架通过将其重构为噪声条件确定性策略,实现了与MLP+PPO相当的样本效率,并在各类基准测试中取得更优性能。
English: Diffusion policies, while powerful for learning diverse skills from demonstrations, face challenges in sample efficiency and optimization when fine-tuned with reinforcement learning, which the proposed NCDPO framework addresses by reformulating them as noise-conditioned deterministic policies to achieve comparable efficiency and superior performance.

Authors:Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu
Title: Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Abstract:
Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.
Chinese: 扩散策略虽能从演示数据中学习多样化技能,但在强化学习微调时存在样本效率低和优化困难的问题;NCDPO框架通过将其重构为噪声条件确定性策略,实现了与MLP+PPO相当的样本效率,并在各类基准测试中取得更优性能。
English: Diffusion policies, while powerful for learning diverse skills from demonstrations, face challenges in sample efficiency and optimization when fine-tuned with reinforcement learning, which the proposed NCDPO framework addresses by reformulating them as noise-conditioned deterministic policies to achieve comparable efficiency and superior performance.

Authors:Liwen Wu, Sai Bi, Zexiang Xu, Hao Tan, Kai Zhang, Fujun Luan, Haolin Lu, Ravi Ramamoorthi
Title: Neural BRDF Importance Sampling by Reparameterization
Abstract:
Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering. Yet their importance sampling remains a significant challenge. In this paper, we introduce a reparameterization-based formulation of neural BRDF importance sampling that seamlessly integrates into the standard rendering pipeline with precise generation of BRDF samples. The reparameterization-based formulation transfers the distribution learning task to a problem of identifying BRDF integral substitutions. In contrast to previous methods that rely on invertible networks and multi-step inference to reconstruct BRDF distributions, our model removes these constraints, which offers greater flexibility and efficiency. Our variance and performance analysis demonstrates that our reparameterization method achieves the best variance reduction in neural BRDF renderings while maintaining high inference speeds compared to existing baselines.
Chinese: 本文提出了一种基于重参数化的神经BRDF重要性采样方法,通过消除对可逆网络的依赖,提高了渲染的灵活性和效率,在实现最佳方差减少的同时保持了高速推理性能。
English: This paper introduces a reparameterization-based method for neural BRDF importance sampling that enhances rendering efficiency and flexibility by eliminating the need for invertible networks, achieving superior variance reduction and fast inference speeds.

Authors:Xiaodan Shao, Rui Zhang, Haibo Zhou, Qijun Jiang, Conghao Zhou, Weihua Zhuang, Xuemin Shen
Title: Polarforming Antenna Enhanced Sensing and Communication: Modeling and Optimization
Abstract:
In this paper, we propose a novel polarforming antenna (PA) to achieve cost-effective wireless sensing and communication. Specifically, the PA can enable polarforming to adaptively control the antenna's polarization electrically as well as tune its position/rotation mechanically, so as to effectively exploit polarization and spatial diversity to reconfigure wireless channels for improving sensing and communication performance. We study an PA-enhanced integrated sensing and communication (ISAC) system that utilizes user location sensing to facilitate communication between an PA-equipped base station (BS) and PA-equipped users. First, we model the PA channel in terms of transceiver antenna polarforming vectors and antenna positions/rotations. We then propose a two-timescale ISAC protocol, where in the slow timescale, user localization is first performed, followed by the optimization of the BS antennas' positions and rotations based on the sensed user locations; subsequently, in the fast timescale, transceiver polarforming is adapted to cater to the instantaneous channel state information (CSI), with the optimized BS antennas' positions and rotations. We propose a new polarforming-based user localization method that uses a structured time-domain pattern of pilot-polarforming vectors to extract the common stable components in the PA channel across different polarizations based on the parallel factor (PARAFAC) tensor model. Moreover, we maximize the achievable average sum-rate of users by jointly optimizing the fast-timescale transceiver polarforming, including phase shifts and amplitude variations, along with the slow-timescale antenna rotations and positions at the BS. Simulation results validate the effectiveness of polarforming-based localization algorithm and demonstrate the performance advantages of polarforming, antenna placement, and their joint design.
中文: 本文提出了一种极化成形天线,通过自适应调控天线极化和位置来提升无线感知与通信性能,并通过联合优化框架和仿真验证了其有效性。
English: This paper introduces a polarforming antenna that adaptively controls polarization and position to enhance wireless sensing and communication performance, validated through a joint optimization framework and simulations.

Authors:Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang
Title: Learning from Peers in Reasoning Models
Abstract:
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .
中文: 大型推理模型虽能自我纠错,但在初始推理不佳时易陷入“前缀主导陷阱”,为此提出的“向同伴学习”(LeaP)方法通过路径间实时共享推理见解突破此局限,在多个数学基准测试中实现显著性能提升。
English: Large Reasoning Models can self-correct errors but struggle when starting with poor reasoning, a phenomenon called the "Prefix Dominance Trap" that the proposed Learning from Peers (LeaP) method addresses by enabling reasoning paths to share insights during inference, achieving significant performance gains across multiple benchmarks.

Authors:Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon
Title: Reassessing Large Language Model Boolean Query Generation for Systematic Reviews
Abstract:
Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs' potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.
中文: 大型语言模型可辅助生成系统综述的布尔查询,但其效果取决于模型选择和提示设计,需针对性优化才能确保文献检索的准确性。
English: Large Language Models can assist in generating Boolean queries for systematic reviews, but their effectiveness depends on model selection and prompt design, requiring careful optimization to ensure accurate literature retrieval.

Authors:Hongyu Rui, Yinzhe Wu, Fanwen Wang, Jiahao Huang, Liutao Yang, Zi Wang, Guang Yang
Title: Decoupling Multi-Contrast Super-Resolution: Pairing Unpaired Synthesis with Implicit Representations
Abstract:
Magnetic Resonance Imaging (MRI) is critical for clinical diagnostics but is often limited by long acquisition times and low signal-to-noise ratios, especially in modalities like diffusion and functional MRI. The multi-contrast nature of MRI presents a valuable opportunity for cross-modal enhancement, where high-resolution (HR) modalities can serve as references to boost the quality of their low-resolution (LR) counterparts-motivating the development of Multi-Contrast Super-Resolution (MCSR) techniques. Prior work has shown that leveraging complementary contrasts can improve SR performance; however, effective feature extraction and fusion across modalities with varying resolutions remains a major challenge. Moreover, existing MCSR methods often assume fixed resolution settings and all require large, perfectly paired training datasets-conditions rarely met in real-world clinical environments. To address these challenges, we propose a novel Modular Multi-Contrast Super-Resolution (MCSR) framework that eliminates the need for paired training data and supports arbitrary upscaling. Our method decouples the MCSR task into two stages: (1) Unpaired Cross-Modal Synthesis (U-CMS), which translates a high-resolution reference modality into a synthesized version of the target contrast, and (2) Unsupervised Super-Resolution (U-SR), which reconstructs the final output using implicit neural representations (INRs) conditioned on spatial coordinates. This design enables scale-agnostic and anatomically faithful reconstruction by bridging un-paired cross-modal synthesis with unsupervised resolution enhancement. Experiments show that our method achieves superior performance at 4x and 8x upscaling, with improved fidelity and anatomical consistency over existing baselines. Our framework demonstrates strong potential for scalable, subject-specific, and data-efficient MCSR in real-world clinical settings.
中文: 本研究提出了一种模块化多对比度超分辨率框架,通过将无配对跨模态合成与无监督超分辨率相结合,克服了需要配对训练数据和固定分辨率的限制,在临床磁共振成像应用中实现了卓越的放大性能和解剖结构保真度。
English: This study introduces a Modular Multi-Contrast Super-Resolution framework that overcomes the limitations of requiring paired training data and fixed resolutions by combining unpaired cross-modal synthesis with unsupervised super-resolution, achieving superior upscaling performance and anatomical fidelity in clinical MRI applications.

Authors:Katrin Hänsel, Luca Maria Aiello, Daniele Quercia, Rossano Schifanella, Krisztian Zsolt Varga, Linus W. Dietz, Marios Constantinides
Title: The Experience of Running: Recommending Routes Using Sensory Mapping in Urban Environments
Abstract:
Depending on the route, runners may experience frustration, freedom, or fulfilment. However, finding routes that are conducive to the psychological experience of running remains an unresolved task in the literature. In a mixed-method study, we interviewed 7 runners to identify themes contributing to running experience, and quantitatively examined these themes in an online survey with 387 runners. Using Principal Component Analysis on the survey responses, we developed a short experience sampling questionnaire that captures the three most important dimensions of running experience: \emph{performance \& achievement}, \emph{environment}, and \emph{mind \& social connectedness}. Using path preferences obtained from the online survey, we clustered them into two types of routes: \emph{scenic} (associated with nature and greenery) and \emph{urban} (characterized by the presence of people); and developed a routing engine for path recommendations. We discuss challenges faced in developing the routing engine, and provide guidelines to integrate it into mobile and wearable running apps.
中文摘要:本研究通过混合方法确定了跑步体验的三大关键维度——表现与成就、环境和心理社交联系,并开发了一个路径推荐引擎,根据跑者偏好提供风景型或城市型路线建议。
English Summary: This study identifies three key dimensions of running experience—performance & achievement, environment, and mind & social connectedness—through a mixed-method approach and develops a routing engine that recommends scenic or urban paths based on runner preferences.

Authors:Yongqiang Zhang, Mustafa A. Kishk, Mohamed-Slim Alouini
Title: High Altitude Platform-Based Caching and Multicasting for Rural Connectivity
Abstract:
Providing efficient and reliable content delivery in rural areas remains a significant challenge due to the lack of communication infrastructure. To bridge the digital divide, this paper investigates the potential of leveraging multiple high-altitude platforms (HAPs) for energy-efficient content delivery in wide rural regions. Each caching-enabled HAP is equipped with both Free-Space Optical (FSO) transceivers for backhaul links and Radio Frequency (RF) antenna arrays for access links. To further enhance network efficiency, we consider a network coding-based multicasting scheme, where different types of content are treated as distinct multicast sessions. With the objective of minimizing long-term power cost, we propose a hierarchical framework that integrates deep reinforcement learn-ing (DRL) and convex optimization to jointly optimize dynamic caching strategies and resource allocation across the network. Simulation results demonstrate that our approach significantly reduces power cost compared to several baseline approaches, providing a practical solution for improving rural connectivity.
中文摘要:本文提出了一种结合深度强化学习与凸优化的分层框架,通过采用网络编码组播的多高空平台系统,显著降低了农村地区高效内容交付的长期能耗成本。
English Summary: This paper proposes a hierarchical framework combining deep reinforcement learning and convex optimization to minimize power costs for energy-efficient content delivery in rural areas using multiple high-altitude platforms with network coding-based multicasting.

Authors:Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos
Title: RayZer: A Self-supervised Large View Synthesis Model
Abstract:
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/
RayZer是一种无需3D监督的自监督三维视觉模型,能够从无位姿图像中重建场景并合成新视角,其性能可与依赖位姿标注的方法相媲美。
RayZer is a self-supervised 3D vision model that reconstructs scenes and synthesizes novel views from unposed images without 3D supervision, achieving performance comparable to pose-dependent methods.

Authors:Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu
Title: Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading
Abstract:
Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge \textbf{D}istillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading.
中文: 提出的不确定性感知多专家知识蒸馏(UMKD)框架通过特征解耦和不确定性感知蒸馏动态传递知识,有效解决了疾病图像分级中的领域偏移和数据不平衡问题,在医学数据集上实现了最优性能。
English: The proposed Uncertainty-aware Multi-experts Knowledge Distillation (UMKD) framework effectively addresses domain shift and data imbalance in disease image grading by dynamically transferring knowledge through feature decoupling and uncertainty-aware distillation, achieving state-of-the-art performance on medical datasets.

Authors:Zixuan Chen, Junhui Yin, Yangtao Chen, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yiwen Hou, Yinchuan Li, Yang Gao
Title: DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation
Abstract:
Generalizing language-conditioned multi-task imitation learning (IL) models to novel long-horizon 3D manipulation tasks remains a significant challenge. To address this, we propose DeCo (Task Decomposition and Skill Composition), a model-agnostic framework compatible with various multi-task IL models, designed to enhance their zero-shot generalization to novel, compositional, long-horizon 3D manipulation tasks. DeCo first decomposes IL demonstrations into a set of modular atomic tasks based on the physical interaction between the gripper and objects, and constructs an atomic training dataset that enables models to learn a diverse set of reusable atomic skills during imitation learning. At inference time, DeCo leverages a vision-language model (VLM) to parse high-level instructions for novel long-horizon tasks, retrieve the relevant atomic skills, and dynamically schedule their execution; a spatially-aware skill-chaining module then ensures smooth, collision-free transitions between sequential skills. We evaluate DeCo in simulation using DeCoBench, a benchmark specifically designed to assess zero-shot generalization of multi-task IL models in compositional long-horizon 3D manipulation. Across three representative multi-task IL models (RVT-2, 3DDA, and ARP), DeCo achieves success rate improvements of 66.67%, 21.53%, and 57.92%, respectively, on 12 novel compositional tasks. Moreover, in real-world experiments, a DeCo-enhanced model trained on only 6 atomic tasks successfully completes 9 novel long-horizon tasks, yielding an average success rate improvement of 53.33% over the base multi-task IL model. Video demonstrations are available at: https://deco226.github.io.
中文: DeCo框架通过将任务分解为可复用的原子技能并利用视觉语言模型动态组合这些技能,显著提升了多任务模仿学习模型在仿真和真实场景中对新型长程三维操作任务的零样本泛化能力。
English: The DeCo framework enhances multi-task imitation learning models by decomposing tasks into reusable atomic skills and dynamically composing them using vision-language models, achieving significant improvements in zero-shot generalization for novel long-horizon 3D manipulation tasks in both simulation and real-world experiments.

Authors:Hanting Wang, Tao Jin, Wang Lin, Shulei Wang, Hai Huang, Shengpeng Ji, Zhou Zhao
Title: IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models
Abstract:
Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce the IRBridge framework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.
中文: 本文提出IRBridge框架,通过弥合扩散过程之间的差异,将预训练生成模型直接应用于图像修复桥梁,无需针对每种退化类型单独训练,从而在多种修复任务中提升了鲁棒性和泛化性能。
English: This paper introduces IRBridge, a framework that integrates pretrained generative models into image restoration bridges by aligning diffusion processes, eliminating the need for task-specific training and enhancing performance across various restoration tasks.

Authors:Jinlu Zhang, Yixin Chen, Zan Wang, Jie Yang, Yizhou Wang, Siyuan Huang
Title: InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
Abstract:
Recent advances in 3D human-aware generation have made significant progress. However, existing methods still struggle with generating novel Human Object Interaction (HOI) from text, particularly for open-set objects. We identify three main challenges of this task: precise human-object relation reasoning, affordance parsing for any object, and detailed human interaction pose synthesis aligning description and object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework without training on specific datasets, leveraging the knowledge from large-scale pre-trained models. Specifically, the human-object relations are inferred from large language models (LLMs) to initialize object properties and guide the optimization process. Then we utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. The initial human pose is generated by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. Finally, we introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts, including hands in grasping. This is achieved by distilling human-level feedback from LLMs to capture detailed human-object relations from the text instruction. Extensive experiments validate the effectiveness of our approach compared to prior works, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.
中文摘要:本文提出了一种零样本三维人-物交互生成框架,通过利用预训练模型克服开放集对象在关系推理、功能解析和姿态合成方面的挑战,无需针对特定数据集进行训练。
English Summary: This paper introduces a zero-shot 3D Human-Object Interaction generation framework that leverages pre-trained models to overcome challenges in reasoning human-object relations, parsing object affordances, and synthesizing detailed interaction poses for open-set objects without dataset-specific training.

Authors:Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, Chang Xu
Title: Revisiting Uncertainty Estimation and Calibration of Large Language Models
Abstract:
As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.
中文: 这项全面研究表明,在评估的80个大语言模型中,语言表达不确定性方法在校准和区分能力上表现最优,同时发现模型高准确率并不保证可靠的置信度估计,且模型规模、推理能力和量化处理都会影响不确定性评估效果。
English: This comprehensive study evaluates 80 large language models and demonstrates that linguistic verbal uncertainty (LVU) consistently outperforms other methods in calibration and discrimination, while revealing that high model accuracy doesn't guarantee reliable uncertainty estimation across different model scales and task types.

Authors:Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen
Title: ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
Abstract:
The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.
中文摘要:本研究提出了一种基于学习的智能机器学习框架,通过强化学习训练7B参数的LLM智能体,使其在少量任务训练后即能超越超大规模模型,并展现出卓越的跨任务泛化能力。
English Summary: This study introduces a novel learning-based agentic ML framework that trains a 7B-sized LLM agent through reinforcement learning, enabling it to outperform significantly larger models and demonstrate superior cross-task generalization despite minimal training data.

Authors:Shujie HU, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu
Title: On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition
Abstract:
This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergence is used to further enforce diversity among experts and their generalization to unseen speakers. Experimental results on the UASpeech corpus suggest that on-the-fly MoE-based adaptation produces statistically significant WER reductions of up to 1.34% absolute (6.36% relative) over the unadapted baseline HuBERT/WavLM models. Consistent WER reductions of up to 2.55% absolute (11.44% relative) and RTF speedups of up to 7 times are obtained over batch-mode adaptation across varying speaker-level data quantities. The lowest published WER of 16.35% (46.77% on very low intelligibility) is obtained.
中文: 本文提出了一种基于MoE的构音障碍语音识别新框架,实现了零样本实时自适应,相比基线模型显著降低了词错误率并提升了计算效率。
English: This paper introduces a novel MoE-based framework for zero-shot, real-time adaptation in dysarthric speech recognition, achieving significant WER reductions and computational efficiency over baseline models.

Authors:Haicheng Liao, Zhenning Li, Guohui Zhang, Keqiang Li, Chengzhong Xu
Title: Towards Human-Like Trajectory Prediction for Autonomous Driving: A Behavior-Centric Approach
Abstract:
Predicting the trajectories of vehicles is crucial for the development of autonomous driving (AD) systems, particularly in complex and dynamic traffic environments. In this study, we introduce HiT (Human-like Trajectory Prediction), a novel model designed to enhance trajectory prediction by incorporating behavior-aware modules and dynamic centrality measures. Unlike traditional methods that primarily rely on static graph structures, HiT leverages a dynamic framework that accounts for both direct and indirect interactions among traffic participants. This allows the model to capture the subtle yet significant influences of surrounding vehicles, enabling more accurate and human-like predictions. To evaluate HiT's performance, we conducted extensive experiments using diverse and challenging real-world datasets, including NGSIM, HighD, RounD, ApolloScape, and MoCAD++. The results demonstrate that HiT consistently outperforms other top models across multiple metrics, particularly excelling in scenarios involving aggressive driving behaviors. This research presents a significant step forward in trajectory prediction, offering a more reliable and interpretable approach for enhancing the safety and efficiency of fully autonomous driving systems.
中文: HiT模型通过行为感知模块和动态中心性度量,提升了自动驾驶中车辆轨迹预测的准确性,尤其在复杂交通环境下表现卓越,显著优于现有先进模型。
English: The HiT model improves vehicle trajectory prediction in autonomous driving by using behavior-aware modules and dynamic centrality to capture complex interactions, outperforming other models in various real-world scenarios.

Authors:Jocelyn Shen, Akhila Yerukola, Xuhui Zhou, Cynthia Breazeal, Maarten Sap, Hae Won Park
Title: Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication
Abstract:
Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.
中文摘要:本研究揭示了自然语言处理模型常忽视人际关系历史对冲突感知的影响,通过引入新数据集证明:尽管人类判断深受关系背景影响,现有模型却难以有效利用这些背景信息,且持续高估信息的积极情感反应。
English Summary: This study highlights how NLP models often miss the impact of personal relationship histories on conflict perception, introducing a novel dataset and demonstrating that while human judgment is significantly influenced by relational backstories, current models fail to effectively utilize this context and tend to overestimate positive emotional responses.

Authors:Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang
Title: GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Abstract:
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
Chinese: 超高分辨率遥感影像面临数据稀缺和令牌爆炸的挑战,通过引入高分辨率数据集和令牌精简策略,开发了GeoLLaVA-8K这一面向遥感的最先进多模态模型。
English: Ultra-high-resolution remote sensing imagery faces challenges of data scarcity and token explosion, which are addressed by introducing high-resolution datasets and token reduction strategies, leading to the development of GeoLLaVA-8K, a state-of-the-art multimodal model for remote sensing.

Authors:Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu
Title: Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Abstract:
This paper presents a novel memory-efficient model compression approach for Conformer ASR and speech foundation systems. Our approach features a unique "small-to-large" design. A compact "seed" model containing a few Conformer or Transformer blocks is trained and unfolded many times to emulate the performance of larger uncompressed models with different logical depths. The seed model and many unfolded paths are jointly trained within a single unfolding cycle. The KL-divergence between the largest unfolded and smallest seed models is used in a self-distillation process to minimize their performance disparity. Experimental results show that our foldable model produces ASR performance comparable to individually constructed Conformer and wav2vec2/HuBERT speech foundation models under various depth configurations, while requiring only minimal memory and storage. Conformer and wav2vec2 models with a reduction of 35% and 30% parameters are obtained without loss of performance, respectively.
中文: 本文提出了一种内存高效的模型压缩方法,通过"从小到大"的设计将紧凑种子模型多次展开以模拟大型模型性能,在Conformer和wav2vec2上分别实现35%和30%的参数减少且保持性能无损。
English: This paper introduces a memory-efficient model compression method for Conformer ASR and speech foundation systems using a "small-to-large" design where a compact seed model is unfolded multiple times to match larger models' performance while achieving 30-35% parameter reduction without performance loss.

Authors:Yiyuan Yang, Shitong Xu, Niki Trigoni, Andrew Markham
Title: Efficient and Microphone-Fault-Tolerant 3D Sound Source Localization
Abstract:
Sound source localization (SSL) is a critical technology for determining the position of sound sources in complex environments. However, existing methods face challenges such as high computational costs and precise calibration requirements, limiting their deployment in dynamic or resource-constrained environments. This paper introduces a novel 3D SSL framework, which uses sparse cross-attention, pretraining, and adaptive signal coherence metrics, to achieve accurate and computationally efficient localization with fewer input microphones. The framework is also fault-tolerant to unreliable or even unknown microphone position inputs, ensuring its applicability in real-world scenarios. Preliminary experiments demonstrate its scalability for multi-source localization without requiring additional hardware. This work advances SSL by balancing the model's performance and efficiency and improving its robustness for real-world scenarios.
中文摘要:本文提出了一种新型三维声源定位框架,通过稀疏交叉注意力和自适应度量提高了定位精度与计算效率,并在实际应用中展现出卓越的鲁棒性和低硬件依赖性。
English Summary: This paper presents a novel 3D sound source localization framework that enhances accuracy and computational efficiency through sparse cross-attention and adaptive metrics, while demonstrating robustness in real-world applications with minimal hardware requirements.

Authors:Yisen Gao, Jiaxin Bai, Tianshi Zheng, Qingyun Sun, Ziwei Zhang, Jianxin Li, Yangqiu Song, Xingcheng Fu
Title: Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs
Abstract:
Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities, with broad applications in areas such as clinical diagnosis and scientific discovery. However, due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses on large-scale knowledge graphs. To address this limitation, we introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning. This task faces two key challenges when controlling for generating long and complex logical hypotheses: hypothesis space collapse and hypothesis oversensitivity. To address these challenges, we propose CtrlHGen, a Controllable logcial Hypothesis Generation framework for abductive reasoning over knowledge graphs, trained in a two-stage paradigm including supervised learning and subsequent reinforcement learning. To mitigate hypothesis space collapse, we design a dataset augmentation strategy based on sub-logical decomposition, enabling the model to learn complex logical structures by leveraging semantic patterns in simpler components. To address hypothesis oversensitivity, we incorporate smoothed semantic rewards including Dice and Overlap scores, and introduce a condition-adherence reward to guide the generation toward user-specified control constraints. Extensive experiments on three benchmark datasets demonstrate that our model not only better adheres to control conditions but also achieves superior semantic similarity performance compared to baselines.
中文:本研究提出CtrlHGen框架,通过逻辑分解解决假设空间塌陷问题,结合语义奖励机制应对假设过敏感,在知识图谱溯因推理中实现可控假设生成,实验证明其优于基线模型的控制遵循度与语义相似性。
English: The study introduces CtrlHGen, a two-stage reinforcement learning framework that enhances controllability in abductive reasoning by addressing hypothesis space collapse through logical decomposition and oversensitivity via semantic rewards, achieving superior adherence to constraints and semantic performance.

Authors:Usman Naseem, Juan Ren, Saba Anwar, Sarah Kohail, Rudy Alexandro Garrido Veliz, Robert Geislinger, Aisha Jabr, Idris Abdulmumin, Laiba Qureshi, Aarushi Ajay Borkar, Maryam Ibrahim Mukhtar, Abinew Ali Ayele, Ibrahim Said Ahmad, Adem Ali, Martin Semmann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Title: POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Abstract:
Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multievent dataset with over 23k instances in seven languages from diverse online platforms and real-world events. Polarization is annotated along three axes: presence, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) we fine-tune six multilingual pretrained language models in both monolingual and cross-lingual setups; and (2) we evaluate a range of open and closed large language models (LLMs) in few-shot and zero-shot scenarios. Results show that while most models perform well on binary polarization detection, they achieve substantially lower scores when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
中文: POLAR数据集通过提供多语言、多文化的在线极化标注,解决了单语和文化狭隘研究的局限,结果显示模型虽在二元检测上表现良好,但在极化类型和表现形式的语境细微差别上仍有不足。
English: The POLAR dataset addresses the limitations of monolingual and culturally narrow research by providing multilingual, multicultural annotations of online polarization, revealing that while models excel at binary detection, they struggle with contextual nuances of polarization types and manifestations.

Authors:Tong Wu, Zhiyong Chen, Dazhi He, Feng Yang, Meixia Tao, Xiaodong Xu, Wenjun Zhang, Ping Zhang
Title: ICDM: Interference Cancellation Diffusion Models for Wireless Semantic Communications
Abstract:
Diffusion models (DMs) have recently achieved significant success in wireless communications systems due to their denoising capabilities. The broadcast nature of wireless signals makes them susceptible not only to Gaussian noise, but also to unaware interference. This raises the question of whether DMs can effectively mitigate interference in wireless semantic communication systems. In this paper, we model the interference cancellation problem as a maximum a posteriori (MAP) problem over the joint posterior probability of the signal and interference, and theoretically prove that the solution provides excellent estimates for the signal and interference. To solve this problem, we develop an interference cancellation diffusion model (ICDM), which decomposes the joint posterior into independent prior probabilities of the signal and interference, along with the channel transition probablity. The log-gradients of these distributions at each time step are learned separately by DMs and accurately estimated through deriving. ICDM further integrates these gradients with advanced numerical iteration method, achieving accurate and rapid interference cancellation. Extensive experiments demonstrate that ICDM significantly reduces the mean square error (MSE) and enhances perceptual quality compared to schemes without ICDM. For example, on the CelebA dataset under the Rayleigh fading channel with a signal-to-noise ratio (SNR) of $20$ dB and signal to interference plus noise ratio (SINR) of 0 dB, ICDM reduces the MSE by 4.54 dB and improves the learned perceptual image patch similarity (LPIPS) by 2.47 dB.
中文: 本文提出了一种干扰消除扩散模型(ICDM),通过分解联合后验概率并将学习到的梯度与数值迭代相结合,有效消除无线语义通信中的干扰,显著降低均方误差并提升感知质量。
English: This paper introduces an interference cancellation diffusion model (ICDM) that effectively mitigates interference in wireless semantic communication by decomposing the joint posterior probability and integrating learned gradients with numerical iteration, significantly reducing mean square error and improving perceptual quality.

Authors:Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou
Title: Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking
Abstract:
Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
中文:RPDR框架通过利用闭源大模型生成数据并微调开源模型,有效解决了低资源环境下生物医学实体链接的难题,提升了准确性并降低了成本。
English: The RPDR framework effectively addresses biomedical entity linking in low-resource scenarios by distilling knowledge from closed-source to open-source LLMs through data generation and fine-tuning, achieving improved accuracy and cost efficiency.

Authors:Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu
Title: MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning
Abstract:
Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.
Chinese: 本研究提出了MT$^{3}$框架,首次将多任务强化学习应用于多模态大语言模型,通过优化文本识别、上下文推理和翻译三个子技能,结合新型混合奖励机制,在端到端图像文本翻译任务中实现了最先进的性能,并创建了XHSPost基准用于真实场景评估。
English: The study introduces MT$^{3}$, a multi-task reinforcement learning framework for end-to-end text image machine translation, which achieves state-of-the-art performance and strong generalization through optimized sub-skills and a novel reward mechanism, alongside the new XHSPost benchmark for evaluation.

Authors:Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Bing Qin
Title: GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis
Abstract:
The Retrieval-Augmented Generation (RAG) framework introduces a retrieval module to dynamically inject retrieved information into the input context of large language models (LLMs), and has demonstrated significant success in various NLP tasks. However, the current study points out that there is a preference gap between retrievers and LLMs in the RAG framework, which limit the further improvement of system performance. Some highly relevant passages may interfere with LLM reasoning because they contain complex or contradictory information; while some indirectly related or even inaccurate content may help LLM generate more accurate answers by providing suggestive information or logical clues. To solve this, we propose GainRAG, a novel approach that aligns the retriever's and LLM's preferences by defining a new metric, "gain", which measure how well an input passage contributes to correct outputs. Specifically, we propose a method to estimate these gain signals and train a middleware that aligns the preferences of the retriever and the LLM using only limited data. In addition, we introduce a pseudo-passage strategy to mitigate degradation. The experimental results on 6 datasets verify the effectiveness of GainRAG.
中文摘要:GainRAG框架通过定义衡量段落效用的“增益”指标,并利用有限数据训练中间件来对齐检索器与大语言模型的偏好,有效解决了RAG系统中两者间的偏好差距问题。
English Summary: The GainRAG framework addresses the preference gap between retrievers and large language models in RAG systems by introducing a "gain" metric to measure passage utility and aligning their preferences through limited data training.

Authors:Jinbang Huang, Yixin Xiao, Zhanguang Zhang, Mark Coates, Jianye Hao, Yingxue Zhang
Title: One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration
Abstract:
Pre-trained Large Language Models (LLMs) have shown promise in solving planning problems but often struggle to ensure plan correctness, especially for long-horizon tasks. Meanwhile, traditional robotic task and motion planning (TAMP) frameworks address these challenges more reliably by combining high-level symbolic search with low-level motion planning. At the core of TAMP is the planning domain, an abstract world representation defined through symbolic predicates and actions. However, creating these domains typically involves substantial manual effort and domain expertise, limiting generalizability. We introduce Planning Domain Derivation with LLMs (PDDLLM), a novel approach that combines simulated physical interaction with LLM reasoning to improve planning performance. The method reduces reliance on humans by inferring planning domains from a single annotated task-execution demonstration. Unlike prior domain-inference methods that rely on partially predefined or language descriptions of planning domains, PDDLLM constructs domains entirely from scratch and automatically integrates them with low-level motion planning skills, enabling fully automated long-horizon planning. PDDLLM is evaluated on over 1,200 diverse tasks spanning nine environments and benchmarked against six LLM-based planning baselines, demonstrating superior long-horizon planning performance, lower token costs, and successful deployment on multiple physical robot platforms.
中文: PDDLLM通过结合大语言模型推理与物理仿真,从演示轨迹自动推导符号规划域,无需人工初始化即可提升长程任务规划成功率并降低成本。
English: PDDLLM enhances robotic long-horizon planning by automatically deriving symbolic domains from demonstrations using LLMs and simulation, achieving higher success rates and reduced costs without manual initialization.

Authors:Jinbang Huang, Yixin Xiao, Zhanguang Zhang, Mark Coates, Jianye Hao, Yingxue Zhang
Title: One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration
Abstract:
Pre-trained large language models (LLMs) show promise for robotic task planning but often struggle to guarantee correctness in long-horizon problems. Task and motion planning (TAMP) addresses this by grounding symbolic plans in low-level execution, yet it relies heavily on manually engineered planning domains. To improve long-horizon planning reliability and reduce human intervention, we present Planning Domain Derivation with LLMs (PDDLLM), a framework that automatically induces symbolic predicates and actions directly from demonstration trajectories by combining LLM reasoning with physical simulation roll-outs. Unlike prior domain-inference methods that rely on partially predefined or language descriptions of planning domains, PDDLLM constructs domains without manual domain initialization and automatically integrates them with motion planners to produce executable plans, enhancing long-horizon planning automation. Across 1,200 tasks in nine environments, PDDLLM outperforms six LLM-based planning baselines, achieving at least 20\% higher success rates, reduced token costs, and successful deployment on multiple physical robot platforms.
中文: PDDLLM通过结合大语言模型推理与物理仿真,从演示轨迹自动推导符号规划域,无需人工初始化即可提升长程任务规划成功率并降低成本。
English: PDDLLM enhances robotic long-horizon planning by automatically deriving symbolic domains from demonstrations using LLMs and simulation, achieving higher success rates and reduced costs without manual initialization.

Authors:Weihang You, Hanqi Jiang, Zishuai Liu, Zihang Xie, Tianming Liu, Jin Lu, Fei Dou
Title: ADLGen: Synthesizing Symbolic, Event-Triggered Sensor Sequences for Human Activity Modeling
Abstract:
Real world collection of Activities of Daily Living data is challenging due to privacy concerns, costly deployment and labeling, and the inherent sparsity and imbalance of human behavior. We present ADLGen, a generative framework specifically designed to synthesize realistic, event triggered, and symbolic sensor sequences for ambient assistive environments. ADLGen integrates a decoder only Transformer with sign based symbolic temporal encoding, and a context and layout aware sampling mechanism to guide generation toward semantically rich and physically plausible sensor event sequences. To enhance semantic fidelity and correct structural inconsistencies, we further incorporate a large language model into an automatic generate evaluate refine loop, which verifies logical, behavioral, and temporal coherence and generates correction rules without manual intervention or environment specific tuning. Through comprehensive experiments with novel evaluation metrics, ADLGen is shown to outperform baseline generators in statistical fidelity, semantic richness, and downstream activity recognition, offering a scalable and privacy-preserving solution for ADL data synthesis.
中文: ADLGen是一种生成框架,通过结合Transformer和大语言模型合成环境辅助系统中逼真的传感器序列,在保持语义和结构一致性的同时,在数据保真度和活动识别方面优于基线方法,且能保护隐私。
English: ADLGen is a generative framework that synthesizes realistic sensor sequences for ambient assistive environments using a Transformer and large language model to ensure semantic and structural coherence, outperforming baselines in fidelity and activity recognition while preserving privacy.

Authors:Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan
Title: TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling
Abstract:
Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu Pro MoE, Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.
中文: 大型推理模型在复杂任务中表现出色,但存在推理冗余问题,TrimR框架通过轻量级验证器动态压缩推理路径,在保持精度的同时将推理速度提升高达70%,适用于工业级部署。
English: Large Reasoning Models (LRMs) excel in complex tasks through extended Chain-of-Thought reasoning, but face inefficiency from redundant thinking, which the proposed TrimR framework addresses by dynamically compressing reasoning paths using a lightweight verifier to boost inference speed by up to 70% with minimal accuracy loss.

Authors:Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, Siheng Chen
Title: X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs
Abstract:
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
中文摘要:本文提出异构大语言模型驱动的多智能体系统(X-MAS),通过整合多样化大语言模型显著提升系统性能,实证研究表明该系统在多个领域比同构系统性能提升最高可达47%。
English Summary: This paper introduces heterogeneous LLM-driven multi-agent systems (X-MAS) that leverage diverse large language models to significantly enhance system performance, demonstrating up to 47% improvement over homogeneous systems through extensive evaluations across multiple domains.

Authors:Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen
Title: MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
Abstract:
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
中文摘要:MASLab作为统一全面的多智能体系统代码库,整合了20多种已验证方法,通过标准化环境和简化结构减少重复工作、确保公平比较,并降低研究门槛。
English Summary: MASLab is introduced as a unified and comprehensive codebase for LLM-based multi-agent systems, integrating over 20 validated methods to reduce redundancy, ensure fair comparisons, and lower research barriers through a streamlined structure.

Authors:Hongyu Li, Matteo Nerini, Shanpu Shen, Bruno Clerckx
Title: A Tutorial on Beyond-Diagonal Reconfigurable Intelligent Surfaces: Modeling, Architectures, System Design and Optimization, and Applications
Abstract:
Written by its inventors, this first tutorial on Beyond-Diagonal Reconfigurable Intelligent Surfaces (BD-RISs) provides the readers with the basics and fundamental tools necessary to appreciate, understand, and contribute to this emerging and disruptive technology. Conventional (Diagonal) RISs (D-RISs) are characterized by a diagonal scattering matrix $\mathbfΘ$ such that the wave manipulation flexibility of D-RIS is extremely limited. In contrast, BD-RIS refers to a novel and general framework for RIS where its scattering matrix is not limited to be diagonal (hence, the ``beyond-diagonal'' terminology) and consequently, all entries of $\mathbfΘ$ can potentially help shaping waves for much higher manipulation flexibility. This physically means that BD-RIS can artificially engineer and reconfigure coupling across elements of the surface thanks to inter-element reconfigurable components which allow waves absorbed by one element to flow through other elements. Consequently, BD-RIS opens the door to more general and versatile intelligent surfaces that subsumes existing RIS architectures as special cases. In this tutorial, we share all the secret sauce to model, design, and optimize BD-RIS and make BD-RIS transformative in many different applications. Topics discussed include physics-consistent and multi-port network-aided modeling; transmitting, reflecting, hybrid, and multi-sector mode analysis; reciprocal and non-reciprocal architecture designs and optimal performance-complexity Pareto frontier of BD-RIS; signal processing, optimization, and channel estimation for BD-RIS; hardware impairments (discrete-value impedance and admittance, lossy interconnections and components, wideband effects, mutual coupling) of BD-RIS; benefits and applications of BD-RIS in communications, sensing, power transfer.
中文: 本教程介绍了超对角可重构智能表面(BD-RIS)这一创新框架,它通过实现单元间耦合突破了传统对角RIS的波束调控限制,并提供了完整的建模、设计与优化方法,将在通信和感知等领域带来革命性应用。
English: This tutorial introduces Beyond-Diagonal Reconfigurable Intelligent Surfaces (BD-RIS), a novel framework that overcomes the limitations of conventional diagonal RIS by enabling inter-element coupling for enhanced wave manipulation, providing comprehensive modeling, design, and optimization methods for transformative applications in communications and sensing.

Authors:Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su
Title: LIFEBench: Evaluating Length Instruction Following in Large Language Models
Abstract:
While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.
Chinese: 尽管大语言模型能解决博士级别的复杂推理问题,却常无法遵循明确的篇幅指令,为此推出的LIFEBench通过多任务评估揭示了它们在各类字数要求下的根本缺陷。
English: Despite excelling at complex reasoning tasks, large language models frequently fail to adhere to explicit length instructions, prompting the creation of LIFEBench to evaluate and reveal their limitations across diverse tasks and word counts.

Authors:Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, Anh Totti Nguyen
Title: Understanding Generative AI Capabilities in Everyday Image Editing Tasks
Abstract:
Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: https://psrdataset.github.io
中文: 本研究通过分析8.3万条真实图像编辑需求发现,当前GPT-4o等AI编辑器仅能完成约33%的编辑任务,在需要精确处理的低创意任务中表现尤差,且经常无法保持人物动物特征,但在开放性创意任务中表现相对较好。
English: This study analyzes 83,000 real-world image editing requests to reveal that current AI editors like GPT-4o can successfully handle only about one-third of tasks, struggling particularly with precise edits and preserving identities despite performing better on creative tasks.

Authors:Zeynab Kaseb, Matthias Moller, Peter Palensky, Pedro P. Vergara
Title: Quantum-Enhanced Power Flow and Optimal Power Flow based on Combinatorial Reformulation
Abstract:
This study introduces the Adiabatic Quantum Power Flow (AQPF) and Adiabatic Quantum Optimal Power Flow (AQOPF) algorithms to solve power flow (PF) and optimal power flow (OPF) problems, respectively. These algorithms utilize a novel combinatorial optimization reformulation of classical PF and OPF problems, and hence, enable their implementation on Ising machines, e.g., quantum and quantum-inspired hardware. The experiments are conducted on standard test cases ranging from 4-bus to 1354-bus systems, using D-Wave's Advantage system (QA), its hybrid quantum-classical solver (HA), as well as the third-generation Digital Annealer (DAv3) and Quantum-Inspired Integrated Optimization software (QIIO) developed by Fujitsu. The annealers are systematically evaluated based on: (i) full and partitioned formulations, (ii) ability to handle ill-conditioned cases, and (iii) scalability. The results are benchmarked against the Newton-Raphson numerical method (NR) and suggest that AQPF and AQOPF can serve as effective solvers or complementary tools to classical methods to address unsolved challenges in large-scale modern power systems.
中文: 本研究提出了绝热量子潮流(AQPF)和绝热量子最优潮流(AQOPF)算法,通过将经典潮流问题重构为组合优化形式实现在量子硬件上的运行,实验证明其可作为传统方法的有效求解器或补充工具应对现代大型电力系统的未解难题。
English: This study presents the Adiabatic Quantum Power Flow (AQPF) and Adiabatic Quantum Optimal Power Flow (AQOPF) algorithms, which reformulate classical power flow problems for implementation on quantum and quantum-inspired hardware, demonstrating their effectiveness as solvers or complementary tools to classical methods in large-scale power systems.

Authors:Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan
Title: VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL
Abstract:
Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.
中文: VARD方法通过引入价值函数和KL正则化,在扩散过程中提供密集可微的监督信号,既能保持与原预训练模型的接近性,又能针对复杂不可微奖励实现稳定高效的强化学习微调。
English: VARD introduces a value function with KL regularization to provide dense, differentiable supervision throughout the diffusion process, enabling stable and efficient fine-tuning of pre-trained models for complex, non-differentiable rewards while maintaining proximity to the original model.

Authors:Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Tajuddeen Gwadabe, Kenneth Church, Vukosi Marivate
Title: HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing
Abstract:
Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.
中文:尽管豪萨语使用人数众多,但其自然语言处理发展仍受限于开源数据集匮乏和模型代表性不足,本文通过建立资源目录并提出战略发展方向,旨在推动该领域研究进程。
English: Despite Hausa's large speaker base, its NLP development faces challenges like limited datasets and poor model representation, prompting this paper to introduce a resource catalog and propose strategic directions for advancement.

Authors:Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, Xuanbing Xie, Rongkui Zhang, Yicheng Zhu, Peihang Li, Tianshuo Yang, Tianxing Chen, Haoyu Gao, Xiaokang Yang, Xuelong Li, Hongyuan Zhang, Yao Mu, Ping Luo
Title: AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory
Abstract:
Vision-language-action (VLA) models have shown promise as generalist robotic policies by jointly leveraging visual, linguistic, and proprioceptive modalities to generate action trajectories. While recent benchmarks have advanced VLA research in domestic tasks, professional science-oriented domains remain underexplored. We introduce AutoBio, a simulation framework and benchmark designed to evaluate robotic automation in biology laboratory environments--an application domain that combines structured protocols with demanding precision and multimodal interaction. AutoBio extends existing simulation capabilities through a pipeline for digitizing real-world laboratory instruments, specialized physics plugins for mechanisms ubiquitous in laboratory workflows, and a rendering stack that support dynamic instrument interfaces and transparent materials through physically based rendering. Our benchmark comprises biologically grounded tasks spanning three difficulty levels, enabling standardized evaluation of language-guided robotic manipulation in experimental protocols. We provide infrastructure for demonstration generation and seamless integration with VLA models. Baseline evaluations with two SOTA VLA models reveal significant gaps in precision manipulation, visual reasoning, and instruction following in scientific workflows. By releasing AutoBio, we aim to catalyze research on generalist robotic systems for complex, high-precision, and multimodal professional environments. The simulator and benchmark are publicly available to facilitate reproducible research.
中文摘要:AutoBio是一个专为生物实验室自动化评估而设计的新型仿真框架与基准测试平台,旨在解决专业科学工作流程中视觉语言动作模型在精密操作和多模态推理方面的不足。
English Summary: AutoBio is a new simulation framework and benchmark designed to evaluate vision-language-action models in biology laboratory automation, addressing gaps in precision manipulation and multimodal reasoning for professional scientific workflows.

Authors:Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François Germain, Jonathan Le Roux
Title: Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
Abstract:
The characteristics of a sound field are intrinsically linked to the geometric and spatial properties of the environment surrounding a sound source and a listener. The physics of sound propagation is captured in a time-domain signal known as a room impulse response (RIR). Prior work using neural fields (NFs) has allowed learning spatially-continuous representations of RIRs from finite RIR measurements. However, previous NF-based methods have focused on monaural omnidirectional or at most binaural listeners, which does not precisely capture the directional characteristics of a real sound field at a single point. We propose a direction-aware neural field (DANF) that more explicitly incorporates the directional information by Ambisonic-format RIRs. While DANF inherently captures spatial relations between sources and listeners, we further propose a direction-aware loss. In addition, we investigate the ability of DANF to adapt to new rooms in various ways including low-rank adaptation.
Chinese: 本研究提出了一种方向感知神经场(DANF),通过采用Ambisonic格式的房间脉冲响应来捕捉声音的方向特性,解决了先前方法的局限性,并探索了其在新环境中的适应性。
English: This study introduces a Direction-Aware Neural Field (DANF) that incorporates Ambisonic-format room impulse responses to capture directional sound characteristics, addressing limitations of previous methods and exploring its adaptability to new environments.

Authors:Jean Tapie, Matteo Nerini, Bruno Clerckx, Philipp del Hougne
Title: Beyond-Diagonal RIS Prototype and Performance Evaluation
Abstract:
We present the first experimental prototype of a reflective beyond-diagonal reconfigurable intelligent surface (BD-RIS), i.e., a RIS with reconfigurable inter-element connections. Our BD-RIS consists of an antenna array whose ports are terminated by a tunable load network. The latter can terminate each antenna port with three distinct individual loads or connect it to an adjacent antenna port. Extensive performance evaluations in a rich-scattering environment validate that inter-element connections are beneficial. Moreover, we observe that our tunable load network's mentioned hardware constraints significantly influence, first, the achievable performance, second, the benefits of having inter-element connections, and, third, the importance of mutual-coupling awareness during optimization.
中文: 我们首次实验展示了反射型超对角可重构智能表面(BD-RIS)原型,其可调元件间连接在复杂散射环境中提升性能,但硬件约束显著影响优化效果和互耦意识的重要性。
English: We introduce the first experimental prototype of a reflective beyond-diagonal reconfigurable intelligent surface (BD-RIS), featuring tunable inter-element connections that enhance performance in rich-scattering environments, while hardware constraints significantly impact optimization and mutual-coupling awareness.

Authors:Zeynab Kaseb, Rahul Rane, Aleksandra Lekic, Matthias Moller, Amin Khodaei, Peter Palensky, Pedro P. Vergara
Title: Quantum Hardware-in-the-Loop for Optimal Power Flow in Renewable-Integrated Power Systems
Abstract:
This paper presents a proof-of-concept for integrating quantum hardware with real-time digital simulator (RTDS) to model and control modern power systems, including renewable energy resources. Power flow (PF) analysis and optimal power flow (OPF) studies are conducted using RTDS coupled with Fujitsu's CMOS Digital Annealer and D-Wave's Advantage quantum processors. The adiabatic quantum power flow (AQPF) and adiabatic quantum optimal power flow (AQOPF) algorithms are used to perform PF and OPF, respectively, on quantum and quantum-inspired hardware. The experiments are performed on the IEEE 9-bus test system and a modified version that includes solar and wind farms. The findings demonstrate that the AQPF and AQOPF algorithms can accurately perform PF and OPF, yielding results that closely match those of classical Newton-Raphson (NR) solvers while also exhibiting robust convergence. Furthermore, the integration of renewable energy sources (RES) within the AQOPF framework proves effective in maintaining system stability and performance, even under variable generation conditions. These findings highlight the potential of quantum computing to significantly enhance the modeling and control of future power grids, particularly in systems with high renewable energy penetration.
中文: 本文展示了将量子硬件与实时数字模拟器集成的概念验证,能够对含可再生能源的电力系统准确执行潮流和最优潮流分析,其结果与传统方法相当且能保持系统稳定性。
English: This paper demonstrates a proof-of-concept for integrating quantum hardware with real-time digital simulators to accurately perform power flow and optimal power flow analyses on power systems with renewable energy, showing results comparable to classical methods while maintaining system stability.

Authors:Ping Xu, Zhiyuan Ning, Pengjiang Li, Wenhao Liu, Pengyang Wang, Jiaxu Cui, Yuanchun Zhou, Pengfei Wang
Title: scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data
Abstract:
Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that~\methodname~outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.
中文: 新型scSiameseClu框架通过结合生物信息增强数据、孪生网络融合防止过度平滑和最优传输聚类,解决了单细胞RNA测序分析中的关键难题,在多个数据集上展现出卓越性能。
English: The novel scSiameseClu framework addresses challenges in single-cell RNA-seq analysis by combining biologically-informed data augmentation, siamese network fusion to prevent over-smoothing, and optimal transport clustering, demonstrating superior performance across multiple datasets.

Authors:Ping Xu, Zhiyuan Ning, Pengjiang Li, Wenhao Liu, Pengyang Wang, Jiaxu Cui, Yuanchun Zhou, Pengfei Wang
Title: scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data
Abstract:
Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.
中文: 新型scSiameseClu框架通过结合生物信息增强数据、孪生网络融合防止过度平滑和最优传输聚类,解决了单细胞RNA测序分析中的关键难题,在多个数据集上展现出卓越性能。
English: The novel scSiameseClu framework addresses challenges in single-cell RNA-seq analysis by combining biologically-informed data augmentation, siamese network fusion to prevent over-smoothing, and optimal transport clustering, demonstrating superior performance across multiple datasets.

Authors:Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio
Title: Search-Based Correction of Reasoning Chains for Language Models
Abstract:
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity, enabling modeling of all possible truth assignments rather than assuming correctness throughout. To efficiently explore this expanded space, we introduce Search Corrector, a discrete search algorithm over boolean-valued veracity assignments. It efficiently performs otherwise intractable inference in the posterior distribution over veracity assignments by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time correction method facilitates supervised fine-tuning of an Amortized Corrector by providing pseudo-labels for veracity. The Amortized Corrector generalizes self-correction, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that Search Corrector reliably identifies errors in logical (ProntoQA) and mathematical reasoning (GSM8K) benchmarks. The Amortized Corrector achieves comparable zero-shot accuracy and improves final answer accuracy by up to 25%.
Chinese: 本文提出Veracity Search (VS)算法,通过潜在真实性变量检测思维链推理中的错误,并开发Amortized Veracity Inference (AVI)模型实现零样本验证,在逻辑、数学和常识推理任务中均证明有效。
English: The paper introduces Veracity Search (VS), a discrete search algorithm that uses latent veracity variables to detect inaccuracies in Chain-of-Thought reasoning, and an Amortized Veracity Inference (AVI) model for zero-shot verification, showing effectiveness across logical, mathematical, and commonsense tasks.

Authors:Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio
Title: Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Abstract:
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
Chinese: 本文提出Veracity Search (VS)算法,通过潜在真实性变量检测思维链推理中的错误,并开发Amortized Veracity Inference (AVI)模型实现零样本验证,在逻辑、数学和常识推理任务中均证明有效。
English: The paper introduces Veracity Search (VS), a discrete search algorithm that uses latent veracity variables to detect inaccuracies in Chain-of-Thought reasoning, and an Amortized Veracity Inference (AVI) model for zero-shot verification, showing effectiveness across logical, mathematical, and commonsense tasks.

Authors:Sukairaj Hafiz Imam, Babangida Sani, Dawit Ketema Gete, Bedru Yimam Ahamed, Ibrahim Said Ahmad, Idris Abdulmumin, Seid Muhie Yimam, Muhammad Yahuza Bello, Shamsuddeen Hassan Muhammad
Title: Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions
Abstract:
Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. The primary goal is to critically analyze these barriers and identify practical, inclusive strategies to advance ASR technologies within the African context. Recent advances and case studies emphasize promising strategies such as community-driven data collection, self-supervised and multilingual learning, lightweight model architectures, and techniques that prioritize privacy. Evidence from pilot projects involving various African languages showcases the feasibility and impact of customized solutions, which encompass morpheme-based modeling and domain-specific ASR applications in sectors like healthcare and education. The findings highlight the importance of interdisciplinary collaboration and sustained investment to tackle the distinct linguistic and infrastructural challenges faced by the continent. This study offers a progressive roadmap for creating ethical, efficient, and inclusive ASR systems that not only safeguard linguistic diversity but also improve digital accessibility and promote socioeconomic participation for speakers of African languages.
中文摘要:本研究分析了非洲低资源语言自动语音识别系统开发面临的主要挑战,并提出社区驱动数据收集、多语言学习等包容性策略,旨在建立保护语言多样性且符合伦理的高效系统。
English Summary: This study analyzes the key challenges in developing automatic speech recognition systems for low-resource African languages and proposes inclusive strategies like community-driven data collection and multilingual learning to create ethical, efficient systems that preserve linguistic diversity.

Authors:Yuzhuo Dai, Jiaqi Jin, Zhibin Dong, Siwei Wang, Xinwang Liu, En Zhu, Xihong Yang, Xinbiao Gan, Yu Feng
Title: Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning
Abstract:
In incomplete multi-view clustering (IMVC), missing data induce prototype shifts within views and semantic inconsistencies across views. A feasible solution is to explore cross-view consistency in paired complete observations, further imputing and aligning the similarity relationships inherently shared across views. Nevertheless, existing methods are constrained by two-tiered limitations: (1) Neither instance- nor cluster-level consistency learning construct a semantic space shared across views to learn consensus semantics. The former enforces cross-view instances alignment, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cross-view cluster counterparts while coarsely handling fine-grained intra-cluster relationships within views. (2) Excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. Thus, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). To bridge semantic gaps across all observations, we learn consensus prototypes from available data to discover a shared space, where semantically similar observations are pulled closer for consensus semantics learning. To capture semantic relationships within specific views, we design a heuristic graph clustering based on modularity to recover cluster structure with intra-cluster compactness and inter-cluster separation for cluster semantics enhancement. Extensive experiments demonstrate, compared to state-of-the-art competitors, FreeCSL achieves more confident and robust assignments on IMVC task.
中文摘要:FreeCSL框架通过从可用数据中学习共识原型构建共享语义空间,并采用基于模块度的启发式图聚类增强簇结构,在不依赖补全和对齐的情况下有效解决了不完整多视图聚类问题。
English Summary: The FreeCSL framework addresses incomplete multi-view clustering by learning consensus prototypes to create a shared semantic space and using heuristic graph clustering to enhance cluster structures, achieving superior performance without imputation or alignment.

Authors:Gabriel Maldonado, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi
Title: MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation
Abstract:
Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results. These results highlight MoCLIP's versatility and effectiveness, establishing it as a robust framework for enhancing motion generation.
Chinese: 本文提出MoCLIP,一种通过运动编码优化的CLIP模型,能更有效捕捉时序和运动结构,在保持与现有系统兼容的同时显著提升了文本到动作的生成质量与对齐效果。
English: This paper introduces MoCLIP, a fine-tuned CLIP model enhanced with motion encoding to better capture temporal and kinematic structures, improving text-to-motion alignment and fidelity while maintaining compatibility with existing pipelines.

Authors:Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, Hua Wei
Title: GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs
Abstract:
Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent's responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models' performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM's conclusion and help with the judgment of the trustworthiness.
中文摘要:本文提出GE-Chat框架,通过构建知识图谱增强检索机制,结合思维链推理和多跳子图搜索,为LLM生成基于证据的可验证回答,有效提升输出结果的可靠性。
English Summary: This paper introduces GE-Chat, a knowledge graph-enhanced framework that improves LLM reliability by generating evidence-based responses through structured knowledge retrieval and verification processes.

Authors:Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, Qing Li
Title: FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation
Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
中文摘要:本文提出FlowDreamer模型,通过显式三维场景流和扩散方法预测机器人操作中的未来视觉帧,在多项基准测试中相比基线模型在语义相似度、像素质量和成功率上均表现出显著优势。
English Summary: This paper introduces FlowDreamer, an RGB-D world model that uses explicit 3D scene flow and diffusion to predict future frames for robot manipulation, achieving superior performance across multiple benchmarks compared to baseline methods.

Authors:Philippe Sauter, Thomas Benz, Paul Scheffler, Martin Povišer, Frank K. Gürkaynak, Luca Benini
Title: Basilisk: A 34 mm2 End-to-End Open-Source 64-bit Linux-Capable RISC-V SoC in 130nm BiCMOS
Abstract:
End-to-end open-source electronic design automation (OSEDA) enables a collaborative approach to chip design conducive to supply chain diversification and zero-trust step-by-step design verification. However, existing end-to-end OSEDA flows have mostly been demonstrated on small designs and have not yet enabled large, industry-grade chips such as Linux-capable systems-on-chip (SoCs). This work presents Basilisk, the largest end-to-end open-source SoC to date. Basilisk's 34 mm2, 2.7 MGE design features a 64-bit Linux-capable RISC-V core, a lightweight 124 MB/s DRAM controller, and extensive IO, including a USB 1.1 host, a video output, and a fully digital 62 Mb/s chip-to-chip (C2C) link. We implement Basilisk in IHP's open 130 nm BiCMOS technology, significantly improving on the state-of-the-art (SoA) OSEDA flow. Our enhancements of the Yosys-based synthesis flow improve design timing and area by 2.3x and 1.6x, respectively, while consuming significantly less system resources. By tuning OpenROAD place and route (P&R) to our design and technology, we decrease the die size by 12%. The fabricated Basilisk chip reaches 62 MHz at its nominal 1.2 V core voltage and up to 102 MHz at 1.64 V. It achieves a peak energy efficiency of 18.9 DP MFLOP/s/W at 0.88 V.
中文:Basilisk是迄今为止最大的端到端开源片上系统,采用64位支持Linux的RISC-V内核,通过改进的开源设计流程在时序、面积和能效方面实现了显著提升。
English: Basilisk is the largest end-to-end open-source system-on-chip to date, featuring a 64-bit Linux-capable RISC-V core and achieving significant improvements in timing, area, and energy efficiency through enhanced open-source design flows.

Authors:Wenjie Liu, Yifei Li, Jian Sun, Gang Wang, Keyou You, Lihua Xie, Jie Chen
Title: Data-driven Internal Model Control for Output Regulation
Abstract:
Output regulation is a fundamental problem in control theory, extensively studied since the 1970s. Traditionally, research has primarily addressed scenarios where the system model is explicitly known, leaving the problem in the absence of a system model less explored. Leveraging the recent advancements in Willems et al.'s fundamental lemma, data-driven control has emerged as a powerful tool for stabilizing unknown systems. This paper tackles the output regulation problem for unknown single and multi-agent systems (MASs) using noisy data. Previous approaches have attempted to solve data-based output regulation equations (OREs), which are inadequate for achieving zero tracking error with noisy data. To circumvent the need for solving data-based OREs, we propose an internal model-based data-driven controller that reformulates the output regulation problem into a stabilization problem. This method is first applied to linear time-invariant (LTI) systems, demonstrating exact solution capabilities, i.e., zero tracking error, through solving a straightforward data-based linear matrix inequality (LMI). Furthermore, we extend our approach to solve the $k$th-order output regulation problem for nonlinear systems. Extensions to both linear and nonlinear MASs are discussed. Finally, numerical tests validate the effectiveness and correctness of the proposed controllers.
Chinese: 本文提出了一种基于内部模型的数据驱动控制器,将输出调节问题转化为镇定问题,无需求解传统输出调节方程即可利用噪声数据实现未知线性和非线性系统的零跟踪误差。
English: This paper introduces an internal model-based data-driven controller that transforms output regulation into a stabilization problem, enabling zero tracking error for unknown linear and nonlinear systems using noisy data without solving traditional output regulation equations.

Authors:Yuekang Li, Wei Song, Bangshuo Zhu, Dong Gong, Yi Liu, Gelei Deng, Chunyang Chen, Lei Ma, Jun Sun, Toby Walsh, Jingling Xue
Title: ai.txt: A Domain-Specific Language for Guiding AI Interactions with the Internet
Abstract:
We introduce ai.txt, a novel domain-specific language (DSL) designed to explicitly regulate interactions between AI models, agents, and web content, addressing critical limitations of the widely adopted robots.txt standard. As AI increasingly engages with online materials for tasks such as training, summarization, and content modification, existing regulatory methods lack the necessary granularity and semantic expressiveness to ensure ethical and legal compliance. ai.txt extends traditional URL-based access controls by enabling precise element-level regulations and incorporating natural language instructions interpretable by AI systems. To facilitate practical deployment, we provide an integrated development environment with code autocompletion and automatic XML generation. Furthermore, we propose two compliance mechanisms: XML-based programmatic enforcement and natural language prompt integration, and demonstrate their effectiveness through preliminary experiments and case studies. Our approach aims to aid the governance of AI-Internet interactions, promoting responsible AI use in digital ecosystems.
中文摘要:我们推出ai.txt这一领域特定语言,它在robots.txt基础上扩展了细粒度的元素级规则和自然语言指令,用以规范AI与网络内容的交互,并提供了开发工具与合规机制支持。
English Summary: We introduce ai.txt, a domain-specific language that extends robots.txt to provide granular, element-level regulations and natural language instructions for governing AI interactions with web content, supported by development tools and compliance mechanisms.

Authors:Frank Reyes, May Mahmoud, Federico Bono, Sarah Nadi, Benoit Baudry, Martin Monperrus
Title: Byam: Fixing Breaking Dependency Updates with Large Language Models
Abstract:
Application Programming Interfaces (APIs) facilitate the integration of third-party dependencies within the code of client applications. However, changes to an API, such as deprecation, modification of parameter names or types, or complete replacement with a new API, can break existing client code. These changes are called breaking dependency updates; It is often tedious for API users to identify the cause of these breaks and update their code accordingly. In this paper, we explore the use of Large Language Models (LLMs) to automate client code updates in response to breaking dependency updates. We evaluate our approach on the BUMP dataset, a benchmark for breaking dependency updates in Java projects. Our approach leverages LLMs with advanced prompts, including information from the build process and from the breaking dependency analysis. We assess effectiveness at three granularity levels: at the build level, the file level, and the individual compilation error level. We experiment with five LLMs: Google Gemini-2.0 Flash, OpenAI GPT4o-mini, OpenAI o3-mini, Alibaba Qwen2.5-32b-instruct, and DeepSeek V3. Our results show that LLMs can automatically repair breaking updates. Among the considered models, OpenAI's o3-mini is the best, able to completely fix 27% of the builds when using prompts that include contextual information such as the buggy line, API differences, error messages, and step-by-step reasoning instructions. Also, it fixes 78% of the individual compilation errors. Overall, our findings demonstrate the potential for LLMs to fix compilation errors due to breaking dependency updates, supporting developers in their efforts to stay up-to-date with changes in their dependencies.
中文: 研究表明,大型语言模型(特别是OpenAI的o3-mini)能通过结合构建信息和依赖变更分析的提示,自动修复因依赖更新导致的客户端代码错误,在Java项目中实现了27%构建级别和78%编译错误级别的修复率。
English: This study demonstrates that Large Language Models (LLMs), particularly OpenAI's o3-mini, can effectively automate client code repairs for breaking dependency updates by leveraging contextual prompts, achieving 27% build-level and 78% compilation error-level fixes in Java projects.

Authors:Chao Ding, Mouxiao Bian, Pengcheng Chen, Hongliang Zhang, Tianbin Li, Lihao Liu, Jiayuan Chen, Zhuoran Li, Yabei Zhong, Yongqi Liu, Haiqing Huang, Dongming Shan, Junjun He, Jie Xu
Title: Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI
Abstract:
Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured rubric, with substandard outputs revised through human effort or guided LLM regeneration until expert consensus. This publicly available dataset provides a vital source for the development of medical LLMs that capable of transparent and verifiable reasoning, thereby advancing safer and more interpretable AI in medicine.
This research addresses the lack of transparency in medical Large Language Models by introducing a clinically validated dataset with expert-reviewed chain-of-thought explanations to enable verifiable AI reasoning in healthcare.
English Summary:

Authors:Chengkai Xu, Jiaqi Liu, Yicheng Guo, Yuhang Zhang, Peng Hang, Jian Sun
Title: Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning
Abstract:
Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for high-level instruction parsing with a Reinforcement Learning (RL) agent for low-level real-time decision. In this dual system, the LLM operates as the "slow" module, translating user directives into structured guidance, while the RL agent functions as the "fast" module, making time-critical maneuvers under stringent latency constraints. By decoupling high-level decision making from rapid control, our framework enables personalized user-centric operation while maintaining robust safety margins. Experimental evaluations across various driving scenarios demonstrate the effectiveness of our method. Compared to baseline algorithms, the proposed architecture not only reduces collision rates but also aligns driving behaviors more closely with user preferences, thereby achieving a human-centric mode. By integrating user guidance at the decision level and refining it with real-time control, our framework bridges the gap between individual passenger needs and the rigor required for safe, reliable driving in complex traffic environments.
中文摘要:该研究提出的"快慢"决策框架将大型语言模型用于解析用户指令与强化学习智能体进行实时控制相结合,在保证安全性的同时实现了更符合个人偏好的自动驾驶体验。
English Summary: The proposed "fast-slow" framework integrates an LLM for interpreting user preferences with an RL agent for real-time control, enabling personalized autonomous driving while maintaining safety across various scenarios.

Authors:Bonan Wang, Haicheng Liao, Chengyue Wang, Bin Rao, Yanchen Guan, Guyang Yu, Jiaxun Zhang, Songning Lai, Chengzhong Xu, Zhenning Li
Title: Beyond Patterns: Harnessing Causal Logic for Autonomous Driving Trajectory Prediction
Abstract:
Accurate trajectory prediction has long been a major challenge for autonomous driving (AD). Traditional data-driven models predominantly rely on statistical correlations, often overlooking the causal relationships that govern traffic behavior. In this paper, we introduce a novel trajectory prediction framework that leverages causal inference to enhance predictive robustness, generalization, and accuracy. By decomposing the environment into spatial and temporal components, our approach identifies and mitigates spurious correlations, uncovering genuine causal relationships. We also employ a progressive fusion strategy to integrate multimodal information, simulating human-like reasoning processes and enabling real-time inference. Evaluations on five real-world datasets--ApolloScape, nuScenes, NGSIM, HighD, and MoCAD--demonstrate our model's superiority over existing state-of-the-art (SOTA) methods, with improvements in key metrics such as RMSE and FDE. Our findings highlight the potential of causal reasoning to transform trajectory prediction, paving the way for robust AD systems.
Chinese: 本文提出了一种基于因果推理的轨迹预测框架,通过识别真实因果关系并融合多模态信息,显著提升了预测的鲁棒性和准确性,在多个真实数据集上验证了其优越性能。
English: This paper introduces a causal inference-based trajectory prediction framework that enhances robustness and accuracy by identifying genuine causal relationships and integrating multimodal information, demonstrating superior performance on multiple real-world datasets.

Authors:Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
Title: QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
Abstract:
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.
中文: QiMeng-TensorOp 是一个自动生成框架,能让大语言模型利用硬件原语生成高性能张量算子,相比原始模型性能提升高达1291倍,并大幅降低开发成本。
English: QiMeng-TensorOp is an auto-generation framework that enables LLMs to create high-performance tensor operators by leveraging hardware primitives, achieving up to 1291x performance improvement over vanilla LLMs and significantly reducing development costs.

Authors:Zihang Song, Matteo Zecchin, Bipin Rajendran, Osvaldo Simeone
Title: Turbo-ICL: In-Context Learning-Based Turbo Equalization
Abstract:
This paper introduces a novel in-context learning (ICL) framework, inspired by large language models (LLMs), for soft-input soft-output channel equalization in coded multiple-input multiple-output (MIMO) systems. The proposed approach learns to infer posterior symbol distributions directly from a prompt of pilot signals and decoder feedback. A key innovation is the use of prompt augmentation to incorporate extrinsic information from the decoder output as additional context, enabling the ICL model to refine its symbol estimates iteratively across turbo decoding iterations. Two model variants, based on Transformer and state-space architectures, are developed and evaluated. Extensive simulations demonstrate that, when traditional linear assumptions break down, e.g., in the presence of low-resolution quantization, ICL equalizers consistently outperform conventional model-based baselines, even when the latter are provided with perfect channel state information. Results also highlight the advantage of Transformer-based models under limited training diversity, as well as the efficiency of state-space models in resource-constrained scenarios.
中文: 本文提出了一种新颖的情境学习框架,通过提示增强技术迭代优化MIMO信道均衡的符号估计,在低分辨率量化等非线性失真场景下显著优于传统方法。
English: This paper presents a novel in-context learning framework for MIMO channel equalization that leverages prompt augmentation to iteratively refine symbol estimates, outperforming conventional methods especially under nonlinear distortions like low-resolution quantization.

Authors:Zhiyuan Ning, Pengfei Wang, Ziyue Qiao, Pengyang Wang, Yuanchun Zhou
Title: Rethinking Graph Contrastive Learning through Relative Similarity Preservation
Abstract:
Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature -- view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-world graphs, we discover a universal pattern transcending the homophily-heterophily dichotomy: label consistency systematically diminishes as structural distance increases, manifesting as smooth decay in homophily graphs and oscillatory decay in heterophily graphs. We establish theoretical guarantees for this pattern through random walk theory, proving label distribution convergence and characterizing the mechanisms behind different decay behaviors. This discovery reveals that graphs naturally encode relative similarity patterns, where structurally closer nodes exhibit collectively stronger semantic relationships. Leveraging this insight, we propose RELGCL, a novel GCL framework with complementary pairwise and listwise implementations that preserve these inherent patterns through collective similarity objectives. Extensive experiments demonstrate that our method consistently outperforms 20 existing approaches across both homophily and heterophily graphs, validating the effectiveness of leveraging natural relative similarity over artificial absolute similarity.
中文: 图对比学习应转向利用自然的相对相似性模式而非人为绝对相似性,本文提出的RELGCL框架通过捕捉标签一致性随结构距离系统性衰减的规律,在各类图数据上持续超越现有方法。
English: Graph contrastive learning should shift from preserving artificial absolute similarity to leveraging natural relative similarity patterns, as demonstrated by the proposed RELGCL framework which consistently outperforms existing methods by capturing how label consistency systematically decays with structural distance.

Authors:Zhaohan Feng, Ruiqi Xue, Lei Yuan, Yang Yu, Ning Ding, Meiqin Liu, Bingzhao Gao, Jian Sun, Xinhu Zheng, Gang Wang
Title: Multi-agent Embodied AI: Advances and Future Directions
Abstract:
Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real-world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single-agent systems that often assume static, closed environments, whereas real-world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real-time learning, and collaborative problem-solving. Despite increasing interest in multi-agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi-agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi-agent embodied AI to address the challenges presented by real-world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.
Chinese: 具身人工智能将AI与物理系统结合以感知和交互环境,但目前多智能体系统的研究仍显不足且缺乏全面综述,本文旨在回顾进展并指明未来方向以推动该领域发展。
English: Embodied AI integrates AI with physical systems to perceive and interact with environments, yet current research on multi-agent systems remains limited and lacks comprehensive surveys, prompting this paper to review advancements and outline future directions.

Authors:Chenxu Peng, Chenxu Wang, Minrui Zou, Danyang Li, Zhengpeng Yang, Yimian Dai, Ming-Ming Cheng, Xiang Li
Title: A Simple Detector with Frame Dynamics is a Strong Tracker
Abstract:
Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.
中文: 本研究提出了一种红外微小目标跟踪器,通过结合全局检测与运动感知学习和时序先验来提升性能,在反无人机挑战赛中取得了领先成绩。
English: This study introduces an infrared tiny-object tracker that improves performance by integrating global detection with motion-aware learning and temporal priors, achieving top results in the Anti-UAV Challenge.

Authors:Yogya Gamage, Deepika Tiwari, Martin Monperrus, Benoit Baudry
Title: The Design Space of Lockfiles Across Package Managers
Abstract:
Software developers reuse third-party packages that are hosted in package registries. At build time, a package manager resolves and fetches the direct and indirect dependencies of a project. Most package managers also generate a lockfile, which records the exact set of resolved dependency versions. Lockfiles are used to reduce build times; to verify the integrity of resolved packages; and to support build reproducibility across environments and time. Despite these beneficial features, developers often struggle with their maintenance, usage, and interpretation. In this study, we unveil the major challenges related to lockfiles, such that future researchers and engineers can address them. We perform the first comprehensive study of lockfiles across 7 popular package managers, npm, pnpm, Cargo, Poetry, Pipenv, Gradle, and Go. First, we highlight the wide variety of design decisions that package managers make, regarding the generation process as well as the content of lockfiles. Next, we conduct a qualitative analysis based on semi-structured interviews with 15 developers. We capture first-hand insights about the benefits that developers perceive in lockfiles, as well as the challenges they face to manage these files. Following these observations, we make 5 recommendations to further improve lockfiles, for a better developer experience.
中文: 本研究揭示了软件开发中锁文件的主要挑战与优势,并针对七种包管理器提出五项改进建议,以优化其管理并提升开发者体验。
English: This study identifies key challenges and benefits of lockfiles in software development, offering recommendations to enhance their management and improve developer experience across seven package managers.

Authors:Hongjin Du, Tuanku Badzlin Hashfi, Rashmi Prasad, Pedro P. Vergara, Peter Palensky, Aleksandra Lekić
Title: Optimal Droop Control Strategy for Coordinated Voltage Regulation and Power Sharing in Hybrid AC-MTDC Systems
Abstract:
With the growing integration of modular multilevel converters (MMCs) in Multi-Terminal Direct Current (MTDC) transmission systems, there is an increasing need for control strategies that ensure both economic efficiency and robust dynamic performance. This paper presents an enhanced Optimal Power Flow (OPF) framework for hybrid AC-MTDC systems, integrating a novel droop control strategy that coordinates DC voltage and AC frequency regulation. By embedding frequency control loops into the MMCs, the method enables system-wide coordination, enhancing power sharing and improving system resilience under disturbances. The proposed strategy dynamically adjusts converter operating points to minimize generation costs and DC voltage deviations, thus balancing economic objectives with system stability. A modified Nordic test system integrated with a four-terminal MTDC grid is used to validate the approach. Optimization is performed using Julia, while the system's dynamic performance is evaluated through electromagnetic transient simulations with the EMTP software. Case studies across multiple scenarios demonstrate that the proposed method consistently achieves lower generation costs than active power control and adaptive droop control strategy while maintaining stable control characteristics. The results highlight the method's capability to deliver cost-effective operation without compromising performance, offering a promising solution for the coordinated control of future hybrid AC-DC transmission networks.
中文: 本文提出了一种用于混合交流-多端直流系统的改进最优潮流框架,通过新型下垂控制策略实现系统范围的协调,在改进的北欧测试系统中验证了其降低发电成本并保持稳定控制性能的能力。
English: This paper introduces an enhanced Optimal Power Flow framework with a novel droop control strategy for hybrid AC-MTDC systems, achieving cost-effective operation and improved stability through system-wide coordination validated on a modified Nordic test system.

Authors:Giacomo Avanzi, Marco Giordani, Michele Zorzi
Title: Multi-Agent Reinforcement Learning Scheduling to Support Low Latency in Teleoperated Driving
Abstract:
The teleoperated driving (TD) scenario comes with stringent Quality of Service (QoS) communication constraints, especially in terms of end-to-end (E2E) latency and reliability. In this context, Predictive Quality of Service (PQoS), possibly combined with Reinforcement Learning (RL) techniques, is a powerful tool to estimate QoS degradation and react accordingly. For example, an intelligent agent can be trained to select the optimal compression configuration for automotive data, and reduce the file size whenever QoS conditions deteriorate. However, compression may inevitably compromise data quality, with negative implications for the TD application. An alternative strategy involves operating at the Radio Access Network (RAN) level to optimize radio parameters based on current network conditions, while preserving data quality. In this paper, we propose Multi-Agent Reinforcement Learning (MARL) scheduling algorithms, based on Proximal Policy Optimization (PPO), to dynamically and intelligently allocate radio resources to minimize E2E latency in a TD scenario. We evaluate two training paradigms, i.e., decentralized learning with local observations (IPPO) vs. centralized aggregation (MAPPO), in conjunction with two resource allocation strategies, i.e., proportional allocation (PA) and greedy allocation (GA). We prove via ns-3 simulations that MAPPO, combined with GA, achieves the best results in terms of latency, especially as the number of vehicles increases.
中文摘要:本文提出基于近端策略优化的多智能体强化学习调度算法,通过动态分配无线资源,仿真验证了集中式训练与贪婪分配相结合的方法在远程驾驶场景中能最有效地降低端到端时延。
English Summary: This paper introduces Multi-Agent Reinforcement Learning scheduling algorithms using Proximal Policy Optimization to dynamically allocate radio resources, demonstrating through simulations that MAPPO combined with greedy allocation minimizes end-to-end latency in teleoperated driving scenarios.

Authors:Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen
Title: QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
Abstract:
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.
Chinese: QiMeng-Xpiler是一种创新的跨平台编译器,通过神经符号合成技术自动转换异构深度学习系统中的张量程序,实现了95%的平均准确率、最高2.0倍的性能提升,并将编程效率提高了96.0倍。
English: QiMeng-Xpiler is a novel transcompiler that uses neural-symbolic synthesis to automatically translate tensor programs across heterogeneous deep learning systems, achieving 95% accuracy and up to 2.0x performance improvement while boosting programming productivity by 96.0x.

Authors:Ziqi Ding, Qian Fu, Junchen Ding, Gelei Deng, Yi Liu, Yuekang Li
Title: A Rusty Link in the AI Supply Chain: Detecting Evil Configurations in Model Repositories
Abstract:
Recent advancements in large language models (LLMs) have spurred the development of diverse AI applications from code generation and video editing to text generation; however, AI supply chains such as Hugging Face, which host pretrained models and their associated configuration files contributed by the public, face significant security challenges; in particular, configuration files originally intended to set up models by specifying parameters and initial settings can be exploited to execute unauthorized code, yet research has largely overlooked their security compared to that of the models themselves; in this work, we present the first comprehensive study of malicious configurations on Hugging Face, identifying three attack scenarios (file, website, and repository operations) that expose inherent risks; to address these threats, we introduce CONFIGSCAN, an LLM-based tool that analyzes configuration files in the context of their associated runtime code and critical libraries, effectively detecting suspicious elements with low false positive rates and high accuracy; our extensive evaluation uncovers thousands of suspicious repositories and configuration files, underscoring the urgent need for enhanced security validation in AI model hosting platforms.
Chinese Summary: 近期大语言模型的进展揭示了如Hugging Face等AI供应链中的安全漏洞,恶意配置文件可能执行未授权代码,为此开发了基于大语言模型的CONFIGSCAN工具,能高效精准地检测此类威胁。
English Summary: Recent advancements in LLMs have exposed security vulnerabilities in AI supply chains like Hugging Face, where malicious configuration files can execute unauthorized code, prompting the development of CONFIGSCAN, an LLM-based tool that effectively detects these threats with high accuracy.

Authors:Hyunji Lee, Franck Dernoncourt, Trung Bui, Seunghyun Yoon
Title: CORG: Generating Answers from Complex, Interrelated Contexts
Abstract:
In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.
Chinese Summary: 本文提出Context Organizer (CORG)框架,通过将多重语境组织成独立处理组来有效应对分散、模糊、反事实和重复等复杂关系,在平衡性能与效率的同时超越了现有方法。
English Summary: This paper introduces Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups to effectively handle complex interrelationships like distracting, ambiguous, counterfactual, and duplicated information, outperforming existing methods while balancing performance and efficiency.

Authors:Yu Xi, Xiaoyu Gu, Haoyu Li, Jun Song, Bo Zheng, Kai Yu
Title: Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding
Abstract:
RNN-T-based keyword spotting (KWS) with autoregressive decoding~(AR) has gained attention due to its streaming architecture and superior performance. However, the simplicity of the prediction network in RNN-T poses an overfitting issue, especially under challenging scenarios, resulting in degraded performance. In this paper, we propose a masked self-distillation (MSD) training strategy that avoids RNN-Ts overly relying on prediction networks to alleviate overfitting. Such training enables masked non-autoregressive (NAR) decoding, which fully masks the RNN-T predictor output during KWS decoding. In addition, we propose a semi-autoregressive (SAR) decoding approach to integrate the advantages of AR and NAR decoding. Our experiments across multiple KWS datasets demonstrate that MSD training effectively alleviates overfitting. The SAR decoding method preserves the superior performance of AR decoding while benefits from the overfitting suppression of NAR decoding, achieving excellent results.
Chinese: 本文提出了一种掩码自蒸馏训练策略和半自回归解码方法,以解决基于RNN-T的关键词检测中的过拟合问题,并在多个数据集上提升了性能。
English: The paper introduces a masked self-distillation training strategy and semi-autoregressive decoding method to address overfitting in RNN-T-based keyword spotting, enhancing performance across various datasets.

Authors:Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi
Title: Bi-Manual Joint Camera Calibration and Scene Representation
Abstract:
Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.
中文摘要:Bi-JCR框架通过利用3D基础模型,无需标定标记即可同时校准多机器人臂上的摄像头,并从RGB图像构建统一的3D工作空间表示,从而有效实现双手协调操作。
English Summary: The Bi-JCR framework eliminates the need for calibration markers by using 3D foundation models to simultaneously calibrate cameras on multiple robot arms and construct a unified 3D workspace representation from RGB images, enabling effective bimanual manipulation.

Authors:Jiangpeng He, Zhihao Duan, Fengqing Zhu
Title: CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
Abstract:
Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining \textbf{task-shared adapters} to learn cross-task knowledge and \textbf{task-specific adapters} to capture unique features of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigate inter-task interference while maintaining the model's plasticity. We demonstrate CL-LoRA consistently achieves promising performance under multiple benchmarks with reduced training and inference computation, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.
中文: CL-LoRA提出了一种双适配器架构,结合任务共享和任务专用组件,在持续学习中高效学习新类别并保留已有知识,以更少计算量实现了优异性能。
English: CL-LoRA introduces a dual-adapter architecture with task-shared and task-specific components to efficiently learn new classes while preserving previous knowledge in continual learning, achieving strong performance with reduced computation.

Authors:Rui Li, Junfeng Kang, Qi Liu, Liyang He, Zheng Zhang, Yunhao Sha, Linbo Zhu, Zhenya Huang
Title: MGS3: A Multi-Granularity Self-Supervised Code Search Framework
Abstract:
In the pursuit of enhancing software reusability and developer productivity, code search has emerged as a key area, aimed at retrieving code snippets relevant to functionalities based on natural language queries. Despite significant progress in self-supervised code pre-training utilizing the vast amount of code data in repositories, existing methods have primarily focused on leveraging contrastive learning to align natural language with function-level code snippets. These studies have overlooked the abundance of fine-grained (such as block-level and statement-level) code snippets prevalent within the function-level code snippets, which results in suboptimal performance across all levels of granularity. To address this problem, we first construct a multi-granularity code search dataset called MGCodeSearchNet, which contains 536K+ pairs of natural language and code snippets. Subsequently, we introduce a novel Multi-Granularity Self-Supervised contrastive learning code Search framework (MGS$^{3}$}). First, MGS$^{3}$ features a Hierarchical Multi-Granularity Representation module (HMGR), which leverages syntactic structural relationships for hierarchical representation and aggregates fine-grained information into coarser-grained representations. Then, during the contrastive learning phase, we endeavor to construct positive samples of the same granularity for fine-grained code, and introduce in-function negative samples for fine-grained code. Finally, we conduct extensive experiments on code search benchmarks across various granularities, demonstrating that the framework exhibits outstanding performance in code search tasks of multiple granularities. These experiments also showcase its model-agnostic nature and compatibility with existing pre-trained code representation models.
中文: 本研究提出了一种多粒度自监督框架(MGS³),通过分层表示和跨粒度对比学习解决了现有代码搜索方法的局限性,在多种代码搜索基准测试中展现出卓越性能。
English: This study introduces a multi-granularity self-supervised framework (MGS³) that addresses limitations in existing code search methods by incorporating hierarchical representation and contrastive learning across different code granularities, demonstrating superior performance across various code search benchmarks.

Authors:Chenyou Fan, Fangzheng Yan, Chenjia Bai, Jiepeng Wang, Chi Zhang, Zhen Wang, Xuelong Li
Title: Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Abstract:
Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.
中文: 本文提出了一种新颖的双臂基础策略,通过微调文本到视频模型来预测机器人轨迹并训练轻量级扩散策略生成动作,利用光流作为中间变量有效解决了双臂操作中的泛化难题,显著降低了语言歧义和数据需求。
English: This paper introduces a novel bimanual foundation policy that fine-tunes text-to-video models to predict robot trajectories and trains a lightweight diffusion policy for action generation, effectively addressing the generalization challenges in bimanual manipulation by using optical flow as an intermediate variable to reduce language ambiguity and data requirements.

Authors:Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, Bo Dai
Title: AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views
Abstract:
We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/
中文: AnySplat是一种前馈网络模型,能够通过单次前向传播从无标定图像中预测3D高斯基元与相机参数,实现实时新视角合成,在无需相机位姿标注的情况下达到了与已知位姿方法相当的质量水平。
English: AnySplat is a feed-forward model that enables real-time novel view synthesis from uncalibrated images by predicting 3D Gaussian primitives and camera parameters in a single pass, achieving quality comparable to pose-aware methods without requiring camera pose annotations.

Authors:Fengxiang Wang, Mingshuo Chen, Xuming He, YiFan Zhang, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang, Wenlong Zhang, Lei Bai
Title: OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Abstract:
Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0\%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.
中文:OmniEarth-Bench作为首个全面的地球科学多模态基准,覆盖所有六个圈层及跨圈层交互,包含100个评估维度,实验显示即使最先进的模型也表现不佳,部分复杂任务准确率近乎为零。
English: OmniEarth-Bench is introduced as the first comprehensive multimodal benchmark for Earth science, spanning all six spheres and cross-sphere interactions with 100 evaluation dimensions, revealing that even advanced models struggle significantly, with some achieving near-zero accuracy in complex tasks.

Authors:Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Yuheng Zhang, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Min Xu
Title: CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis
Abstract:
Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.
中文摘要:CryoCCD是一种创新的冷冻电镜合成框架,通过结合先进生物物理建模与条件扩散模型,能生成结构精确且噪声逼真的显微图像,显著提升粒子分析性能并超越现有最优方法。
English Summary: CryoCCD is a novel cryo-EM synthesis framework that combines advanced biophysical modeling with a conditional diffusion model to generate structurally accurate micrographs with realistic noise, significantly improving particle analysis and outperforming existing methods.

Authors:Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Minhao Wu, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Muyuan Chen, Min Xu
Title: CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis
Abstract:
Single-particle cryo-electron microscopy (cryo-EM) has become a cornerstone of structural biology, enabling near-atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a synthesis framework that unifies versatile biophysical modeling with the first conditional cycle-consistent diffusion model tailored for cryo-EM. The biophysical engine provides multi-functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state-of-the-art baselines, while also generalizing effectively to held-out protein families.
中文摘要:CryoCCD是一种创新的冷冻电镜合成框架,通过结合先进生物物理建模与条件扩散模型,能生成结构精确且噪声逼真的显微图像,显著提升粒子分析性能并超越现有最优方法。
English Summary: CryoCCD is a novel cryo-EM synthesis framework that combines advanced biophysical modeling with a conditional diffusion model to generate structurally accurate micrographs with realistic noise, significantly improving particle analysis and outperforming existing methods.

Authors:Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, Kay Chen Tan
Title: Diversity-Aware Policy Optimization for Large Language Model Reasoning
Abstract:
The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
中文: 本研究提出了一种多样性感知策略优化方法,通过在强化学习训练中提升解决方案多样性来增强大语言模型的推理能力,在四项数学推理基准测试中平均提升3.5%,同时生成更多样化和稳健的解决方案。
English: The study introduces a diversity-aware policy optimization method that enhances LLM reasoning by promoting solution diversity during reinforcement learning training, achieving a 3.5% average improvement across mathematical benchmarks while generating more robust solutions.

Authors:Srishti Gupta, Daniele Angioni, Maura Pintor, Ambra Demontis, Lea Schönherr, Battista Biggio, Fabio Roli
Title: Buffer-free Class-Incremental Learning with Out-of-Distribution Detection
Abstract:
Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.
中文: 本研究表明,在类别增量学习中,事后分布外检测方法可有效替代基于记忆缓冲区的方案,在处理新类别和拒绝未知样本方面达到相当或更优性能,同时保护隐私并提高效率。
English: This study demonstrates that post-hoc out-of-distribution detection methods can effectively replace memory buffer-based approaches in class-incremental learning, achieving comparable or better performance in handling new classes and rejecting unknown samples while preserving privacy and efficiency.

Authors:Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su
Title: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
Abstract:
Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer
中文摘要:RoboTransfer是一种基于扩散的视频生成框架,通过多视角几何集成和场景组件控制,提升了机器人数据合成的几何一致性与视觉保真度,并在策略训练中实现了显著性能提升。
English Summary: RoboTransfer is a diffusion-based video generation framework that enhances geometric consistency and visual fidelity in robotic data synthesis, achieving significant performance improvements in policy training.

Authors:Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen
Title: Query Routing for Retrieval-Augmented Language Models
Abstract:
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
中文: 检索增强生成(RAG)提升了大型语言模型在知识密集型任务中的表现,但需智能路由机制选择最佳模型;RAGRouter通过结合文档影响和对比学习,平均性能提升达3.61%,显著优于现有方法。
English: Retrieval-Augmented Generation (RAG) enhances LLMs for knowledge tasks but requires intelligent routing to select the optimal model, leading to the development of RAGRouter, which integrates document influence and outperforms existing methods by up to 9.33%.

Authors:Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu
Title: Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates
Abstract:
This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.
中文: 本研究提出一种单阶段压缩语音基础模型的方法,通过紧凑的自压缩门将剪枝与参数更新结合,在显著减少参数的同时保持性能无损,并在准确性和效率上超越现有方法。
English: This study introduces a single-stage method for compressing speech foundation models by integrating pruning and parameter updates using compact self-pinching gates, achieving significant parameter reduction without compromising performance and outperforming prior methods in both accuracy and efficiency.

Authors:Xinyi Chen, Chenxiang Ma, Yujie Wu, Kay Chen Tan, Jibin Wu
Title: Neuromorphic Sequential Arena: A Benchmark for Neuromorphic Temporal Processing
Abstract:
Temporal processing is vital for extracting meaningful information from time-varying signals. Recent advancements in Spiking Neural Networks (SNNs) have shown immense promise in efficiently processing these signals. However, progress in this field has been impeded by the lack of effective and standardized benchmarks, which complicates the consistent measurement of technological advancements and limits the practical applicability of SNNs. To bridge this gap, we introduce the Neuromorphic Sequential Arena (NSA), a comprehensive benchmark that offers an effective, versatile, and application-oriented evaluation framework for neuromorphic temporal processing. The NSA includes seven real-world temporal processing tasks from a diverse range of application scenarios, each capturing rich temporal dynamics across multiple timescales. Utilizing NSA, we conduct extensive comparisons of recently introduced spiking neuron models and neural architectures, presenting comprehensive baselines in terms of task performance, training speed, memory usage, and energy efficiency. Our findings emphasize an urgent need for efficient SNN designs that can consistently deliver high performance across tasks with varying temporal complexities while maintaining low computational costs. NSA enables systematic tracking of advancements in neuromorphic algorithm research and paves the way for developing effective and efficient neuromorphic temporal processing systems.
中文摘要:神经形态序列竞技场(NSA)基准通过提供多样化的现实世界任务,系统性评估脉冲神经网络的性能、效率和能耗,解决了该领域缺乏标准化评估的问题,揭示了开发更稳定且低功耗SNN设计的迫切需求。
English Summary: The Neuromorphic Sequential Arena (NSA) benchmark addresses the lack of standardized evaluation for Spiking Neural Networks by providing diverse real-world tasks to systematically assess performance, efficiency, and energy consumption, revealing the need for more consistent and cost-effective SNN designs.

Authors:Jing-An Sun, Hang Fan, Junchao Gong, Ben Fei, Kun Chen, Fenghua Ling, Wenlong Zhang, Wanghan Xu, Li Yan, Pierre Gentine, Lei Bai
Title: Align-DA: Align Score-based Atmospheric Data Assimilation with Multiple Preferences
Abstract:
Data assimilation (DA) aims to estimate the full state of a dynamical system by combining partial and noisy observations with a prior model forecast, commonly referred to as the background. In atmospheric applications, this problem is fundamentally ill-posed due to the sparsity of observations relative to the high-dimensional state space. Traditional methods address this challenge by simplifying background priors to regularize the solution, which are empirical and require continual tuning for application. Inspired by alignment techniques in text-to-image diffusion models, we propose Align-DA, which formulates DA as a generative process and uses reward signals to guide background priors, replacing manual tuning with data-driven alignment. Specifically, we train a score-based model in the latent space to approximate the background-conditioned prior, and align it using three complementary reward signals for DA: (1) assimilation accuracy, (2) forecast skill initialized from the assimilated state, and (3) physical adherence of the analysis fields. Experiments with multiple reward signals demonstrate consistent improvements in analysis quality across different evaluation metrics and observation-guidance strategies. These results show that preference alignment, implemented as a soft constraint, can automatically adapt complex background priors tailored to DA, offering a promising new direction for advancing the field.
中文: Align-DA提出了一种生成式数据同化方法,通过奖励信号实现数据驱动的对齐以替代人工调参,在不同评估指标下持续提升分析质量,为该领域提供了新方向。
English: Align-DA introduces a generative data assimilation approach that replaces manual tuning with data-driven alignment using reward signals, consistently improving analysis quality across metrics and offering a novel direction for the field.

Authors:Weiguang Zhang, Huangcheng Lu, Maizhen Ning, Xiaowei Huang, Wei Wang, Kaizhu Huang, Qiufeng Wang
Title: DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model
Abstract:
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$\times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document \textbf{D}ewarping \textbf{v}ia a \textbf{D}iffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available.
中文: 本文提出DvD,首个基于扩散模型的文档去扭曲生成方法,通过坐标级去噪和时间变体条件优化机制,在矫正文档形变的同时有效保持结构完整性,在包括新构建的AnyPhotoDoc6300数据集在内的多个基准测试中均达到最优性能。
English: This paper introduces DvD, the first diffusion-based generative model for document dewarping that uses coordinate-level denoising and a time-variant condition refinement mechanism to effectively rectify document deformations while preserving structural integrity, achieving state-of-the-art performance across multiple benchmarks including the newly proposed AnyPhotoDoc6300 dataset.

Authors:Hongyao Chen, Tianyang Xu, Xiaojun Wu, Josef Kittler
Title: Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning
Abstract:
Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the input-output distributions for each batch of data. However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance. To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e. , means and variances used for evaluation) from that of learnable parameters (i.e. , parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance. It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data.
中文摘要:批归一化在联邦学习中因非独立同分布数据而表现不佳,因此开发了混合批归一化方法,通过分离统计参数与可学习参数的更新,并自适应地融合本地批次统计量与全局统计量来解决此问题。
English Summary: Batch Normalization is problematic in federated learning due to non-IID data, so Hybrid Batch Normalization was developed to separate statistical parameter updates from learnable parameters and adaptively blend local batch statistics with global statistics.

Authors:Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei
Title: OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models
Abstract:
Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.
中文: 本文提出首个评估文本到图像模型过度拒绝现象的大规模基准OVERT,揭示了该问题在各类安全主题中的普遍性,并指出需在保持模型实用性的同时改进安全对齐方法。
English: This paper introduces OVERT, the first large-scale benchmark for evaluating over-refusal in text-to-image models, revealing its prevalence across safety categories and highlighting the need for improved safety alignment that maintains model utility.

Authors:Yansen Zhang, Bowei He, Xiaokun Zhang, Haolun Wu, Zexu Sun, Chen Ma
Title: Counterfactual Multi-player Bandits for Explainable Recommendation Diversification
Abstract:
Existing recommender systems tend to prioritize items closely aligned with users' historical interactions, inevitably trapping users in the dilemma of ``filter bubble''. Recent efforts are dedicated to improving the diversity of recommendations. However, they mainly suffer from two major issues: 1) a lack of explainability, making it difficult for the system designers to understand how diverse recommendations are generated, and 2) limitations to specific metrics, with difficulty in enhancing non-differentiable diversity metrics. To this end, we propose a \textbf{C}ounterfactual \textbf{M}ulti-player \textbf{B}andits (CMB) method to deliver explainable recommendation diversification across a wide range of diversity metrics. Leveraging a counterfactual framework, our method identifies the factors influencing diversity outcomes. Meanwhile, we adopt the multi-player bandits to optimize the counterfactual optimization objective, making it adaptable to both differentiable and non-differentiable diversity metrics. Extensive experiments conducted on three real-world datasets demonstrate the applicability, effectiveness, and explainability of the proposed CMB.
Chinese: 现有推荐系统因偏重用户历史相似内容而陷入“过滤泡沫”困境,现有多样性改进方法缺乏可解释性且受限于特定指标,因此提出反事实多玩家赌博机方法,实现可解释且适应多种指标的推荐多样化。
English: Current recommender systems often trap users in filter bubbles by prioritizing items similar to their history, but existing diversity improvement methods lack explainability and flexibility across metrics, leading to the proposal of a Counterfactual Multi-player Bandits (CMB) approach that delivers explainable and adaptable diversification.

Authors:Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, Guihai Chen
Title: Automated Privacy Information Annotation in Large Language Model Interactions
Abstract:
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application domains, typically tagging personally identifiable information (PII) in anonymous content, which is insufficient in real-name interaction scenarios with LLMs. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
中文: 现有隐私检测方法在实名交互场景中不足,为此构建了大规模多语言数据集和基线模型以推动本地隐私检测发展,但当前性能与实际应用需求仍存差距。
English: Current privacy detection methods are inadequate for real-name interactions with large language models, prompting the creation of a large-scale multilingual dataset and baseline models to advance local privacy detection, though a performance gap remains for practical applications.

Authors:Yi Zhan, Qi Liu, Weibo Gao, Zheng Zhang, Tianfu Wang, Shuanghong Shen, Junyu Lu, Zhenya Huang
Title: CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models
Abstract:
Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students' programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
中文: 个性化编程辅导系统面临数据不足和评估不匹配的挑战,而提出的基于大语言模型的CoderAgent模拟器借鉴ACT-R认知架构,通过编程思维树实现细粒度学习过程模拟,为提升教育效果提供可解释的见解和精准模拟。
English: Personalized programming tutoring systems face challenges due to limited data and evaluation mismatches, but the proposed CoderAgent, an LLM-based agent inspired by ACT-R, simulates fine-grained learning processes using the Programming Tree of Thought to provide interpretable insights and accurate simulations for enhanced education.

Authors:Bhawna Piryani, Abdelrahman Abdallah, Jamshid Mozafari, Avishek Anand, Adam Jatowt
Title: It's High Time: A Survey of Temporal Question Answering
Abstract:
Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Question Answering (TQA), a research area that focuses on answering questions involving temporal constraints or context. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. We focus on recent advances in TQA enabled by neural architectures, especially transformer-based models and Large Language Models (LLMs), highlighting progress in temporal language modeling, retrieval-augmented generation (RAG), and temporal reasoning. We also discuss benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.
中文: 本综述全面介绍了时序问答领域,重点分析了基于Transformer和大语言模型的神经架构在时序意图识别、推理及时间标注数据处理方面的最新进展与挑战。
English: This survey offers a comprehensive overview of Temporal Question Answering (TQA), focusing on recent advances in neural architectures like transformers and LLMs that address challenges in temporal intent detection, reasoning, and leveraging time-stamped data.

Authors:Gianmarco Genalti, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Title: Data-Dependent Regret Bounds for Constrained MABs
Abstract:
This paper initiates the study of data-dependent regret bounds in constrained MAB settings. These bounds depend on the sequence of losses that characterize the problem instance. Thus, they can be much smaller than classical $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bounds, while being equivalent to them in the worst case. Despite this, data-dependent regret bounds have been completely overlooked in constrained MAB settings. The goal of this paper is to answer the following question: Can data-dependent regret bounds be derived in the presence of constraints? We answer this question affirmatively in constrained MABs with adversarial losses and stochastic constraints. Specifically, our main focus is on the most challenging and natural settings with hard constraints, where the learner must ensure that the constraints are always satisfied with high probability. We design an algorithm with a regret bound consisting of two data-dependent terms. The first term captures the difficulty of satisfying the constraints, while the second one encodes the complexity of learning independently of the presence of constraints. We also prove a lower bound showing that these two terms are not artifacts of our specific approach and analysis, but rather the fundamental components that inherently characterize the complexities of the problem. Finally, in designing our algorithm, we also derive some novel results in the related (and easier) soft constraints settings, which may be of independent interest.
中文: 本文首次在具有对抗性损失和随机约束的多臂老虎机问题中引入数据依赖的遗憾界,提出了一种算法,其遗憾界包含反映约束满足难度和学习复杂度的两个关键项,并证明这些是问题的本质组成部分。
English: This paper introduces data-dependent regret bounds for constrained multi-armed bandits with adversarial losses and stochastic constraints, developing an algorithm that achieves bounds reflecting both constraint satisfaction difficulty and learning complexity, while proving these components are fundamental to the problem.

Authors:Zheng Zhang, Shaocheng Lan, Lei Song, Jiang Bian, Yexin Li, Kan Ren
Title: Learning to Select In-Context Demonstration Preferred by Large Language Model
Abstract:
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.
中文: GenICL是一种新颖的生成式偏好学习框架,它利用大语言模型的反馈直接优化上下文学习中的示例选择,在多个数据集和任务上相比现有方法实现了更优的性能。
English: GenICL is a novel generative preference learning framework that uses LLM feedback to directly optimize demonstration selection for in-context learning, achieving superior performance across multiple datasets and tasks compared to existing methods.

Authors:Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
Abstract:
Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.
中文:大型语言模型在人类可解的谜题上表现不佳,因此Enigmata推出了首个集成可扩展生成器与验证器的谜题推理套件,显著提升了模型在各类基准测试中的表现,并展现出优异的数学与STEM任务泛化能力。
English: Large Language Models struggle with human-solvable puzzles, so Enigmata introduces a comprehensive suite with scalable generators and verifiers to enhance puzzle reasoning, achieving superior performance on benchmarks and demonstrating strong generalization to math and STEM tasks.

Authors:Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Title: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Abstract:
Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.
中文摘要:ScaleKV框架通过将变换器层分类为起草器和精炼器,有效解决了视觉自回归模型中KV缓存的效率问题,在保持图像质量的同时实现了90%的内存减少。
English Summary: The proposed ScaleKV framework addresses the KV cache inefficiency in Visual Autoregressive models by classifying transformer layers into drafters and refiners, achieving 90% memory reduction while maintaining image quality.

Authors:Takumi Goto, Yusuke Sakai, Taro Watanabe
Title: gec-metrics: A Unified Library for Grammatical Error Correction Evaluation
Abstract:
We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.
中文: gec-metrics 库通过统一接口开发和运用语法纠错评估指标,确保公平的系统比较和 API 可扩展性,并提供元评估与分析工具,遵循 MIT 许可发布。
English: The gec-metrics library offers a unified interface for developing and applying grammatical error correction evaluation metrics, ensuring fair system comparisons and extensibility through its API design, along with meta-evaluation and analysis tools, released under the MIT license.

Authors:Chenrui Ma, Xi Xiao, Tianyang Wang, Yanning Shen
Title: Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions
Abstract:
Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.
Chinese Summary: 本研究提出了一种新颖的图像编辑范式,利用广泛可用的文本-图像对和可学习的多尺度区域来实现精确编辑,无需依赖专门的编辑数据集,在多个基准测试中展现出领先性能。
English Summary: This study introduces a novel image editing method that utilizes widely available text-image pairs and a multi-scale learnable region to achieve precise, high-fidelity edits without relying on specialized editing datasets, demonstrating state-of-the-art performance across multiple benchmarks.

Authors:Weiming Zhi, Ziyong Ma, Tianyi Zhang, Matthew Johnson-Roberson
Title: From Single Images to Motion Policies via Video-Generation Environment Representations
Abstract:
Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.
中文摘要:本研究提出VGER框架,通过视频生成模型从单张RGB图像构建环境表征,使自主机器人能够生成符合场景几何结构的无碰撞运动轨迹。
English Summary: This study introduces VGER, a framework that uses video generation models to create environment representations from a single RGB image, enabling autonomous robots to produce collision-free motions consistent with scene geometry.

Authors:Shiyue Wang, Haozheng Xu, Yuhan Zhang, Jingran Lin, Changhong Lu, Xiangfeng Wang, Wenhao Li
Title: Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding
Abstract:
Multi-Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision-free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real-world multi-robot coordination. This comprehensive survey bridges the long-standing divide between classical algorithmic approaches and emerging learning-based methods in MAPF research. We present a unified framework that encompasses search-based methods (including Conflict-Based Search, Priority-Based Search, and Large Neighborhood Search), compilation-based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data-driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger-scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning-based approaches (predominantly 10-100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed-motive MAPF with game-theoretic considerations, language-grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real-world applications.
中文: 这篇多智能体路径规划(MAPF)综述弥合了经典算法与新兴学习方法之间的鸿沟,通过系统分析评估实践揭示了标准化基准的必要性,并展望了博弈论规划与神经求解器等未来方向。
English: This comprehensive survey on Multi-Agent Path Finding (MAPF) unifies classical algorithmic approaches with emerging learning-based methods, providing a systematic analysis of evaluation practices while highlighting the need for standardized benchmarking and outlining future directions like game-theoretic planning and neural solver architectures.

Authors:Zhenyu Li, Özlem Tuğfe Demir, Emil Björnson, Cicek Cavdar
Title: RIS-Assisted Survivable Fronthaul Design in Cell-Free Massive MIMO System
Abstract:
This paper investigates the application of reconfigurable intelligent surfaces (RISs) to improve fronthaul link survivability in cell-free massive MIMO (CF mMIMO) systems. To enhance the fronthaul survivability, two complementary mechanisms are considered. Firstly, RIS is set to provide reliable line-of-sight (LOS) connectivity and enhance the mmWave backup link. Secondly, a resource-sharing scheme that leverages redundant cable capacity through neighboring master access points (APs) to guarantee availability is considered. We formulate the redundant capacity minimization problem as a RIS-assisted multi-user MIMO rate control optimization problem, developing a novel solution that combines a modified weighted minimum mean square error (WMMSE) algorithm for precoding design with Riemannian gradient descent for RIS phase shift optimization. Our numerical evaluations show that RIS reduces the required redundant capacity by 65.6% compared to the no RIS case to reach a 99% survivability. The results show that the most substantial gains of RIS occur during complete outages of the direct disconnected master AP-CPU channel. These results demonstrate RIS's potential to significantly enhance fronthaul reliability while minimizing infrastructure costs in next-generation wireless networks.
中文: 本研究证明,在无蜂窝大规模MIMO系统中,可重构智能表面通过增强备用链路和实现资源共享,可将所需冗余容量降低65.6%以实现99%的生存率,显著提升前传可靠性。
English: This study demonstrates that reconfigurable intelligent surfaces (RIS) can enhance fronthaul survivability in cell-free massive MIMO systems by improving backup links and enabling resource sharing, reducing required redundant capacity by 65.6% to achieve 99% reliability.

Authors:Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic
Title: Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Abstract:
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
中文: 本文提出一种多模态文本转语音系统,能够通过人脸图像生成语音并利用文本描述控制语音特征,通过创新的训练和解码方法解决了音频质量、艺术肖像适应及语音一致性三大挑战。
English: This paper presents a multi-modal TTS system that generates voice from face images and controls speech characteristics via text descriptions, addressing challenges in audio quality, artistic portrait adaptation, and voice consistency through innovative training and decoding methods.

Authors:Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Title: Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Abstract:
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
中文:SVG2提出了一种无需训练的框架,通过语义感知的令牌聚类和重排序,提升关键令牌识别精度并优化计算效率,在保持高质量视频生成的同时实现了显著加速。
English: SVG2 introduces a training-free framework with semantic-aware token clustering and reordering to enhance critical token identification and computational efficiency, achieving significant speedups while maintaining high video generation quality.

Authors:Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
Title: Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Abstract:
We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.
中文: 本研究提出了v1,一种轻量级扩展,通过点选复制机制使模型能主动参考视觉信息,确保推理基于感知证据,从而在多模态推理任务中提升性能。
English: The study introduces v1, a lightweight extension that enables models to actively reference visual information through a point-and-copy mechanism, improving performance on multimodal reasoning tasks by keeping inferences grounded in perceptual evidence.

Authors:Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu
Title: v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
Abstract:
When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing point-and-copy as a practical mechanism for grounded reasoning. The model checkpoint and dataset are available at github.com/jun297/v1.
中文: 本研究提出了v1,一种轻量级扩展,通过点选复制机制使模型能主动参考视觉信息,确保推理基于感知证据,从而在多模态推理任务中提升性能。
English: The study introduces v1, a lightweight extension that enables models to actively reference visual information through a point-and-copy mechanism, improving performance on multimodal reasoning tasks by keeping inferences grounded in perceptual evidence.

Authors:Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Chien-Kai Kuo, Jui-Wei Chang, Kwang-Ju Kim, Chung-I Huang, Jenq-Neng Hwang
Title: Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking
Abstract:
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
Chinese: 我们提出了一种有效方法,通过利用Segment Anything Model 2的预训练能力并引入关键优化技术,使其适用于视觉目标跟踪任务,在2024年ICPR挑战赛中取得了89.4的最高AUC评分。
English: We propose an effective method that adapts the Segment Anything Model 2 for Visual Object Tracking by leveraging its pre-trained capabilities and incorporating key optimizations, achieving a top AUC score of 89.4 in the 2024 ICPR challenge.

Authors:Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
Title: A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances
Abstract:
Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.
中文: 本综述提出了一个统一的核集选择方法分类法,整合了免训练、面向训练和无标签三大研究方向,涵盖了现有工作中常被忽视的子领域,并在不同计算与鲁棒性需求下进行了比较分析。
English: This survey provides a unified taxonomy of coreset selection methods, integrating training-free, training-oriented, and label-free approaches while addressing overlooked subfields and comparing their performance under diverse computational and robustness requirements.

Authors:Ming Hu, Zhengdi Yu, Feilong Tang, Kaiwen Chen, Yulong Li, Imran Razzak, Junjun He, Tolga Birdal, Kaijing Zhou, Zongyuan Ge
Title: Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery
Abstract:
Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.
中文: OphNet-3D推出了首个全面的眼科手术RGB-D三维重建数据集,通过自动标注和新基准测试,其H-Net和OH-Net模型在精度上显著超越了现有方法。
English: OphNet-3D introduces the first comprehensive RGB-D dataset for 3D reconstruction in ophthalmic surgery, featuring automatic annotation and novel benchmarks where H-Net and OH-Net models significantly outperform existing methods in accuracy.

Authors:Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai
Title: EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
Abstract:
Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .
中文: 本文提出一个全面的地球科学基准,通过包含开放式对话的多层级数据集评估大语言模型能力,实验表明现有模型虽取得进展,但在科学探索能力上仍有显著不足。
English: This paper introduces a comprehensive Earth science benchmark to evaluate LLMs' capabilities, featuring multi-level datasets including open-ended dialogues that reveal significant limitations in current models despite their advancements.

Authors:Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, Yuheng Bu
Title: In-Context Watermarks for Large Language Models
Abstract:
The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.
中文摘要:本文提出的上下文水印技术通过提示工程实现模型无关的水印嵌入,解决了现有方法需要模型访问权限的限制,为实际场景中检测AI生成文本提供了可行方案。
English Summary: In-Context Watermarking (ICW) is introduced as a model-agnostic method that embeds watermarks through prompt engineering, addressing the limitations of existing techniques by enabling practical detection of AI-generated text without requiring model access.

Authors:Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang
Title: REOBench: Benchmarking Robustness of Earth Observation Foundation Models
Abstract:
Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.
中文: 地球观测基础模型虽具良好泛化能力,但在现实图像扰动下性能显著下降,REOBench基准测试通过六类任务和十二种干扰类型的系统评估,揭示了模型脆弱性并指出构建更强健模型的必要性。
English: Earth observation foundation models exhibit strong generalization but suffer significant performance degradation under real-world image corruptions, as revealed by the comprehensive REOBench evaluation across six tasks and twelve corruption types, highlighting the need for more robust models.

Authors:Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zelin Peng, Zhiwei Yang, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge
Title: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
Abstract:
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
中文: 针对多模态大语言模型中的幻觉问题,本研究提出FarSight解码策略,通过优化因果掩码增强令牌交互并减少异常令牌的注意力干扰,从而显著提升模型性能。
English: Recent multimodal large language models often suffer from hallucinations, but the proposed FarSight decoding strategy effectively mitigates this issue by optimizing causal masks to enhance token interaction and reduce attention to outlier tokens.

Authors:Renyi Zhong, Yichen Li, Guangba Yu, Wenwei Gu, Jinxi Kuang, Yintong Huo, Michael R. Lyu
Title: Larger Is Not Always Better: Exploring Small Open-source Language Models in Logging Statement Generation
Abstract:
Developers use logging statements to create logs that document system behavior and aid in software maintenance. As such, high-quality logging is essential for effective maintenance; however, manual logging often leads to errors and inconsistency. Recent methods emphasize using large language models (LLMs) for automated logging statement generation, but these present privacy and resource issues, hindering their suitability for enterprise use. This paper presents the first large-scale empirical study evaluating small open-source language models (SOLMs) for automated logging statement generation. We evaluate four prominent SOLMs using various prompt strategies and parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA) and Retrieval-Augmented Generation (RAG). Our results show that fine-tuned SOLMs with LoRA and RAG prompts, particularly Qwen2.5-coder-14B, outperform existing tools and LLM baselines in predicting logging locations and generating high-quality statements, with robust generalization across diverse repositories. These findings highlight SOLMs as a privacy-preserving, efficient alternative for automated logging.
中文摘要:本研究表明,经过微调的小型开源语言模型(特别是采用LoRA和RAG技术的Qwen2.5-coder-14B)在自动生成日志语句方面优于现有方法,同时具备隐私保护和高效能优势。
English Summary: This study demonstrates that fine-tuned small open-source language models, especially Qwen2.5-coder-14B with LoRA and RAG techniques, surpass existing methods in generating automated logging statements while offering privacy protection and efficiency advantages.

Authors:Huitong Yang, Zhuoxiao Chen, Fengyi Zhang, Zi Huang, Yadan Luo
Title: CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving
Abstract:
Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.
中文: CodeMerge提出了一种轻量级模型融合框架,通过在紧凑的潜在空间中操作来提升3D物体检测性能,在自动驾驶基准测试中取得了显著改进,且不牺牲适应质量。
English: CodeMerge introduces a lightweight model merging framework that enhances 3D object detection performance by operating in a compact latent space, achieving significant improvements in autonomous driving benchmarks without compromising adaptation quality.

Authors:Jing Bi, Pinxin Liu, Ali Vosoughi, Jiarui Wu, Jinxi He, Chenliang Xu
Title: $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion
Abstract:
The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.
本研究提出了一种语言驱动框架,通过建模指令语言结构并引入三项技术创新,将流程性文本转化为连贯的视觉指导,在多个数据集上显著提升了生成序列化视觉内容的准确性。
This study introduces a language-driven framework that converts procedural text into visual instructions through linguistic structure modeling and three technical innovations, significantly outperforming existing methods in generating accurate sequential visuals across multiple datasets.

Authors:Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, Junchi Yan
Title: Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)
Abstract:
Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.
中文摘要:强化学习能克服模仿学习在自动驾驶中的局限性,而提出的Raw2Drive模型通过双流模型强化学习方法,在不依赖特权信息的情况下有效填补了研究空白,实现了最先进的性能表现。
English Summary: Reinforcement Learning can overcome imitation learning's limitations in autonomous driving, and the proposed Raw2Drive model effectively bridges the gap by using dual-stream model-based reinforcement learning to achieve state-of-the-art performance without requiring privileged information.

Authors:Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du
Title: SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.
中文摘要:本文提出一种基于稀疏自编码器的监督引导方法,通过可解释向量精准控制大语言模型行为,在多项任务中实现高效引导且保持生成质量。
English Summary: This paper introduces a supervised steering method using sparse autoencoders to create interpretable vectors that effectively control LLM behaviors in targeted tasks with minimal performance loss.

Authors:Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang
Title: SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Abstract:
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6\%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.
中文摘要:大型推理模型(LRMs)通过显式推理获得优异性能,但面临有害查询的安全风险;提出的SafeKey方法通过激活生成过程中的关键安全节点,在保持通用能力的同时显著提升了安全泛化性能。
English Summary: Large Reasoning Models (LRMs) achieve strong performance through explicit reasoning but face safety risks from harmful queries, leading to the proposed SafeKey method that enhances safety generalization by activating critical safety moments in generation while preserving general capabilities.

Authors:Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang
Title: How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
Abstract:
Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory operations that are widely used by many agent frameworks-addition, which incorporates new experiences into the memory base, and deletion, which selectively removes past experiences-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where outdated or irrelevant experiences negatively influence current tasks. Through controlled experiments, we show that combining selective addition and deletion strategies can help mitigate these negative effects, yielding an average absolute performance gain of 10% compared to naive memory growth. Furthermore, we highlight how memory management choices affect agents' behavior under challenging conditions such as task distribution shifts and constrained memory resources. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance. We also release our code to facilitate further study.
Chinese: 研究表明,通过选择性添加和删除记忆经验的有效记忆管理,能够缓解大语言模型智能体中的错误传播和记忆回放失准问题,在多种挑战性条件下平均提升10%的长期性能。
English: This study demonstrates that effective memory management, through selective addition and deletion of experiences, can mitigate error propagation and misaligned experience replay in LLM agents, improving long-term performance by 10% on average under various challenging conditions.

Authors:Kazuaki Mishima, Antoni Bigata Casademunt, Stavros Petridis, Maja Pantic, Kenji Suzuki
Title: FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion
Abstract:
Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emotion. While recent advances in image generation have enabled high-quality identity-conditional face synthesis, precise control over non-identity attributes remains challenging, and disentangling identity from these mutable factors is particularly difficult. To address these limitations, we propose a novel identity-conditional diffusion model that introduces two lightweight control modules designed to independently manipulate facial pose, expression, and emotion without compromising identity preservation. These modules are embedded within the cross-attention layers of the base diffusion model, enabling precise attribute control with minimal parameter overhead. Furthermore, our tailored training strategy, which leverages cross-attention between the identity feature and each non-identity control feature, encourages identity features to remain orthogonal to control signals, enhancing controllability and diversity. Quantitative and qualitative evaluations, along with perceptual user studies, demonstrate that our method surpasses existing approaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.
Chinese: 本文提出了一种新颖的身份条件扩散模型,通过轻量级控制模块独立操纵面部姿态、表情和情感,并利用正交特征训练保持身份特征,在控制精度和生成多样性方面均优于现有方法。
English: This paper introduces a novel identity-conditional diffusion model with lightweight control modules that independently manipulate facial pose, expression, and emotion while preserving identity through orthogonal feature training, outperforming existing methods in control accuracy and diversity.

Authors:Chenlin Ming, Chendi Qu, Mengzhang Cai, Qizhi Pei, Zhuoshi Pan, Yu Li, Xiaoming Duan, Lijun Wu, Conghui He
Title: IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
Abstract:
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model's alignment and performance across multiple capabilities. IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks. Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7% in multi-task evaluation scores.
中文摘要:IDEAL框架通过基于梯度的动态数据分配优化多领域训练集的组成,使模型在跨任务评估中实现约7%的综合性能提升。
English Summary: The IDEAL framework optimizes the data distribution in multi-domain training sets through gradient-based adjustments, enhancing model performance across diverse tasks by approximately 7% in multi-task evaluations.

Authors:Yang Mu, Zhitong Xiong, Yi Wang, Muhammad Shahzad, Franz Essl, Mark van Kleunen, Xiao Xiang Zhu
Title: GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification
Abstract:
Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.
Chinese: GlobalGeoTree数据集包含630万棵树样本及配套环境数据,其基准模型GeoTreeCLIP在零样本和小样本分类上表现优异,为树种识别和生物多样性研究提供了重要基准。
English: The GlobalGeoTree dataset, with 6.3 million tree samples and paired environmental data, enables advanced tree species classification through its baseline model GeoTreeCLIP, which shows significant improvements in zero- and few-shot learning.

Authors:Xiaokun Zhang, Bo Xu, Chenliang Li, Bowei He, Hongfei Lin, Chen Ma, Fenglong Ma
Title: A Survey on Side Information-driven Session-based Recommendation: From a Data-centric Perspective
Abstract:
Session-based recommendation is gaining increasing attention due to its practical value in predicting the intents of anonymous users based on limited behaviors. Emerging efforts incorporate various side information to alleviate inherent data scarcity issues in this task, leading to impressive performance improvements. The core of side information-driven session-based recommendation is the discovery and utilization of diverse data. In this survey, we provide a comprehensive review of this task from a data-centric perspective. Specifically, this survey commences with a clear formulation of the task. This is followed by a detailed exploration of various benchmarks rich in side information that are pivotal for advancing research in this field. Afterwards, we delve into how different types of side information enhance the task, underscoring data characteristics and utility. Moreover, we discuss the usage of various side information, including data encoding, data injection, and involved techniques. A systematic review of research progress is then presented, with the taxonomy by the types of side information. Finally, we summarize the current limitations and present the future prospects of this vibrant topic.
中文摘要:本综述从数据中心的视角系统回顾了基于会话的推荐研究,重点探讨了各类辅助信息如何通过丰富数据集、优化技术方法来解决数据稀疏性问题,并详细分析了其应用机制与研究进展。
English Summary: This survey comprehensively reviews session-based recommendation from a data-centric perspective, focusing on how diverse side information alleviates data scarcity and enhances prediction accuracy through systematic exploration of benchmarks, utility analysis, and technical implementations.

Authors:Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang
Title: Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Abstract:
Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model's alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
Chinese: 针对用户数据微调带来的安全风险,提出的Safe Delta方法通过调整增量参数,在保持模型效用的同时有效维护了安全性。
English: Large language models face safety risks from fine-tuning with user data, but the proposed Safe Delta method effectively maintains model safety without compromising utility by adjusting delta parameters.

Authors:Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan
Title: GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Abstract:
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
Chinese: 本研究提出了GIE-Bench这一新基准,通过功能正确性和图像内容保持度指标评估文本引导图像编辑模型,发现GPT-Image-1虽在指令遵循上表现优异,但容易过度修改非目标区域。
English: This work introduces GIE-Bench, a new benchmark for evaluating text-guided image editing models through functional correctness and image content preservation metrics, revealing that while GPT-Image-1 excels in instruction accuracy, it tends to over-modify non-targeted areas.

Authors:Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Title: MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark
Abstract:
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.
中文: 我们推出了首个大规模多语言音视频深度伪造检测基准,包含八种语言超过250小时的内容,实验表明现有检测器在面对训练时未见的生成模型和语言时性能显著下降。
English: We introduce the first large-scale multilingual audio-video deepfake detection benchmark, featuring over 250 hours of content across eight languages and revealing that current detectors struggle in open-set scenarios where unseen generation models and languages are tested.

Authors:Shanhui Zhao, Hao Wen, Wenjie Du, Cheng Liang, Yunxin Liu, Xiaozhou Ye, Ye Ouyang, Yuanchun Li
Title: LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps
Abstract:
Large language models (LLMs) have opened new opportunities for automated mobile app exploration, an important and challenging problem that used to suffer from the difficulty of generating meaningful UI interactions. However, existing LLM-based exploration approaches rely heavily on LLMs to generate actions in almost every step, leading to a huge cost of token fees and computational resources. We argue that such extensive usage of LLMs is neither necessary nor effective, since many actions during exploration do not require, or may even be biased by the abilities of LLMs. Further, based on the insight that a precise and compact knowledge plays the central role for effective exploration, we introduce LLM-Explorer, a new exploration agent designed for efficiency and affordability. LLM-Explorer uses LLMs primarily for maintaining the knowledge instead of generating actions, and knowledge is used to guide action generation in a LLM-less manner. Based on a comparison with 5 strong baselines on 20 typical apps, LLM-Explorer was able to achieve the fastest and highest coverage among all automated app explorers, with over 148x lower cost than the state-of-the-art LLM-based approach.
中文: LLM-Explorer提出了一种高效的移动应用探索智能体,通过以知识维护为核心而非依赖大语言模型生成操作,在实现更高覆盖率的同时将成本降低至现有方法的1/148。
English: LLM-Explorer introduces an efficient mobile app exploration agent that reduces reliance on LLMs for action generation by prioritizing knowledge maintenance, achieving higher coverage at 148x lower cost than existing methods.

Authors:Ziming Liu, Yizhou Liu, Jeff Gore, Max Tegmark
Title: Neural Thermodynamic Laws for Large Language Model Training
Abstract:
Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.
中文: 本文提出神经热力学定律(NTL)新框架,揭示了热力学原理在大语言模型训练中的涌现现象,并为学习率调度提供了实用指导原则。
English: This paper introduces Neural Thermodynamic Laws (NTL), a new framework that reveals how thermodynamic principles emerge in large language model training and provides practical guidelines for learning rate schedules.

Authors:JieHao Wu, Ziwei Wang, Junjie Sheng, Wenhao Li, Xiangfeng Wang, Jun Luo
Title: Learning Virtual Machine Scheduling in Cloud Computing through Language Agents
Abstract:
In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.
中文: 本文提出MiCo分层语言代理框架,利用大语言模型设计启发式方法解决云虚拟机调度中的在线动态多维装箱问题,在大规模场景下实现了96.9%的竞争比高性能。
English: This paper introduces MiCo, a hierarchical language agent framework that uses large language models to design heuristics for solving the complex online dynamic multidimensional bin packing problem in cloud VM scheduling, achieving high performance in large-scale scenarios.

Authors:Siyuan Yan, Xieji Li, Ming Hu, Yiwen Jiang, Zhen Yu, Zongyuan Ge
Title: MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment
Abstract:
Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.
中文摘要:本文提出MAKE框架,通过将临床描述分解为知识增强子文本并与图像特征进行细粒度对齐,解决了皮肤病AI中文本长度限制和结构化文本缺乏的问题,在多种零样本皮肤病任务中显著优于现有视觉语言预训练模型。
English Summary: The paper introduces MAKE, a multi-aspect knowledge-enhanced VLP framework that addresses dermatological AI limitations by decomposing clinical narratives into knowledge-enhanced sub-texts and aligning them with relevant image features, achieving superior zero-shot performance across multiple dermatological tasks.

Authors:Sheng Liang, Hang Lv, Zhihao Wen, Yaxiong Wu, Yongyue Zhang, Hao Wang, Yong Liu
Title: Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation
Abstract:
Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in existing pipeline systems, and (2) the absence of benchmarks for evaluating joint schema matching and extraction. Although large language models (LLMs) offer potential solutions, their schema hallucination tendencies and context window limitations pose challenges for practical deployment. In response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel paradigm combining schema paraphrasing with schema retrieval-augmented generation. ASEE adeptly retrieves paraphrased schemas and accurately generates targeted structures. To facilitate rigorous evaluation, we construct the Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which systematically consolidates 12 datasets across diverse domains, complexity levels, and language settings. Extensive evaluations on MD-SEE show that our proposed ASEE demonstrates strong adaptability across various scenarios, significantly improving the accuracy of event extraction.
中文摘要:事件抽取作为自然语言处理中的基础任务,需从文本中识别事件信息,涉及模式选择与执行,但现有研究存在模式僵化和缺乏联合评估基准的问题;为此提出的自适应模式感知事件抽取(ASEE)方法结合模式释义与检索增强生成,在MD-SEE基准测试中展现出强大的跨场景适应能力,显著提升了抽取准确性。
English Summary: Event extraction is a crucial NLP task for identifying events from text, requiring schema selection and execution, yet faces gaps in schema flexibility and evaluation benchmarks, which the proposed Adaptive Schema-aware Event Extraction (ASEE) addresses through schema paraphrasing and retrieval-augmented generation, showing strong adaptability in evaluations on the MD-SEE benchmark.

Authors:Wenzhen Yue, Yong Liu, Haoxuan Li, Hao Wang, Xianghua Ying, Ruohao Guo, Bowei Xing, Ji Shi
Title: OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain
Abstract:
This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based multivariate time series forecasting model that operates in an $\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the performance of TF. To address this, some forecasters conduct encoding and decoding in the transformed domain using fixed, dataset-independent bases (e.g., sine and cosine signals in the Fourier transform). In contrast, we utilize $\mathbf{OrthoTrans}$, a data-adaptive transformation based on an orthogonal matrix that diagonalizes the series' temporal Pearson correlation matrix. This approach enables more effective encoding and decoding in the decorrelated feature domain and can serve as a plug-in module to enhance existing forecasters. To enhance the representation learning for multivariate time series, we introduce a customized linear layer, $\mathbf{NormLin}$, which employs a normalized weight matrix to capture multivariate dependencies. Empirically, the NormLin module shows a surprising performance advantage over multi-head self-attention, while requiring nearly half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency. Notably, as a plug-in replacement for self-attention, the NormLin module consistently enhances Transformer-based forecasters. The code and datasets are available at https://anonymous.4open.science/r/OLinear
中文: 本文提出OLinear模型,通过自适应正交变换解耦时间依赖关系,并采用标准化线性层高效捕捉多变量关联,在保持高计算效率的同时实现了最先进的预测性能。
English: This paper introduces OLinear, a novel multivariate time series forecasting model that utilizes an adaptive orthogonal transformation to decorrelate temporal dependencies and a normalized linear layer to efficiently capture multivariate relationships, achieving state-of-the-art performance with high computational efficiency.

Authors:Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, Jiang Bian
Title: OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval
Abstract:
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems.
Chinese: 本文提出了一种多模态检索增强生成系统,采用由粗到精的多步检索策略,协调不同模态和知识粒度,在KB-VQA基准测试中实现了最先进的性能,显著提升了检索精度和答案质量。
English: This paper introduces a multimodal retrieval-augmented generation system that employs a coarse-to-fine, multi-step retrieval strategy to harmonize diverse modalities and knowledge granularities, achieving state-of-the-art performance on KB-VQA benchmarks by enhancing both retrieval accuracy and answer quality.

Authors:Oleg Sautenkov, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Faryal Batool, Jeffrin Sam, Artem Lykov, Chih-Yung Wen, Dzmitry Tsetserukou
Title: UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning
Abstract:
We present UAV-CodeAgents, a scalable multi-agent framework for autonomous UAV mission generation, built on large language and vision-language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high-level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision-grounded, pixel-pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real-time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV-CodeAgents is evaluated on large-scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine-tune Qwen2.5VL-7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision-language-based UAV planning.
中文摘要:UAV-CodeAgents是一个基于大语言/视觉语言模型的多智能体框架,通过卫星图像解析和协同轨迹规划自主生成无人机任务,在火灾检测场景中以96.96秒平均任务创建时间实现93%成功率,并具备实时环境适应能力。
English Summary: UAV-CodeAgents is a multi-agent framework using LLMs/VLMs to autonomously generate UAV missions through satellite imagery interpretation and collaborative trajectory planning, achieving 93% success rate in fire detection scenarios with minimal human supervision.

Authors:Mingxue Yan, Xuewen Zhang, Kaixiang Zhang, Zhaojian Li, Xunyuan Yin
Title: Economic data-enabled predictive control using machine learning
Abstract:
In this paper, we propose a convex data-based economic predictive control method within the framework of data-enabled predictive control (DeePC). Specifically, we use a neural network to transform the system output into a new state space, where the nonlinear economic cost function of the underlying nonlinear system is approximated using a quadratic function expressed by the transformed output in the new state space. Both the neural network parameters and the coefficients of the quadratic function are learned from open-loop data of the system. Additionally, we reconstruct constrained output variables from the transformed output through learning an output reconstruction matrix; this way, the proposed economic DeePC can handle output constraints explicitly. The performance of the proposed method is evaluated via a case study in a simulated chemical process.
中文: 本文提出了一种基于数据的凸经济预测控制方法,通过神经网络将系统输出转换到新状态空间,用二次函数近似非线性经济成本,并借助学习的重构矩阵处理输出约束。
English: This paper introduces a convex data-driven economic predictive control method that uses a neural network to transform system outputs into a new state space, approximating nonlinear economic costs with a quadratic function and handling output constraints through learned reconstruction.

Authors:Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong
Title: Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence
Abstract:
The rise of large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, has reshaped the artificial intelligence landscape. As prominent examples of foundational models (FMs) built on LLMs, these models exhibit remarkable capabilities in generating human-like content, bringing us closer to achieving artificial general intelligence (AGI). However, their large-scale nature, sensitivity to privacy concerns, and substantial computational demands present significant challenges to personalized customization for end users. To bridge this gap, this paper presents the vision of artificial personalized intelligence (API), focusing on adapting these powerful models to meet the specific needs and preferences of users while maintaining privacy and efficiency. Specifically, this paper proposes personalized federated intelligence (PFI), which integrates the privacy-preserving advantages of federated learning (FL) with the zero-shot generalization capabilities of FMs, enabling personalized, efficient, and privacy-protective deployment at the edge. We first review recent advances in both FL and FMs, and discuss the potential of leveraging FMs to enhance federated systems. We then present the key motivations behind realizing PFI and explore promising opportunities in this space, including efficient PFI, trustworthy PFI, and PFI empowered by retrieval-augmented generation (RAG). Finally, we outline key challenges and future research directions for deploying FM-powered FL systems at the edge with improved personalization, computational efficiency, and privacy guarantees. Overall, this survey aims to lay the groundwork for the development of API as a complement to AGI, with a particular focus on PFI as a key enabling technique.
中文: 本文提出人工个性化智能(API)愿景,并引入个性化联邦智能(PFI),通过融合联邦学习与基础模型的优势,在边缘端实现兼顾隐私保护与效率的个性化部署。
English: This paper introduces the vision of artificial personalized intelligence (API) and proposes personalized federated intelligence (PFI) to address the challenges of large language models by integrating federated learning with foundational models for privacy-preserving, efficient personalization at the edge.

Authors:Xijie Yang, Linning Xu, Lihan Jiang, Dahua Lin, Bo Dai
Title: Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes
Abstract:
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5's Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance.
中文: 3D高斯泼溅(3DGS)能够从多视角图像重建精细的3D模型,但在大规模场景中面临实时渲染难题;我们提出的虚拟化3D高斯(V3DG)通过基于聚类的细节层次系统,动态选择必要集群以加速渲染,同时保持视觉质量。
English: 3D Gaussian Splatting (3DGS) enables detailed 3D reconstruction from multi-view images but faces real-time rendering challenges with large-scale scenes, which our proposed Virtualized 3D Gaussians (V3DG) addresses through a cluster-based LOD system that dynamically selects necessary clusters to accelerate rendering while maintaining visual quality.

Authors:Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song
Title: AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents
Abstract:
The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent-based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box fuzzing framework, AgentVigil, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, thereby maximizing the likelihood of uncovering agent weaknesses. We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of baseline attacks. Moreover, AgentVigil exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyond benchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
中文: 大语言模型的强大功能带来了间接提示注入等安全风险,为此提出的AgentVigil黑盒模糊测试框架,能在基准测试和实际环境中高效发现LLM代理的漏洞。
English: Large Language Models' advanced capabilities introduce security risks like indirect prompt injection, leading to the development of AgentVigil, a black-box fuzzing framework that effectively uncovers vulnerabilities in LLM agents across benchmarks and real-world scenarios.

Authors:Genghua Kou, Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Ziheng Zhang, Osamu Yoshie, Tiancai Wang, Ying Li, Xiangyu Zhang
Title: PADriver: Towards Personalized Autonomous Driving
Abstract:
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.
中文: 本文提出PADriver,一种基于多模态大语言模型的个性化自动驾驶闭环框架,通过处理视频流和个性化提示进行场景理解、危险评估与决策,并在定制基准测试中展现出优于现有方法的性能。
English: This paper introduces PADriver, a closed-loop framework using Multi-modal Large Language Models for personalized autonomous driving, which processes streaming frames and personalized prompts to perform scene understanding, danger estimation, and action decisions, and demonstrates superior performance on a custom benchmark compared to existing methods.

Authors:Wenhao Li, Bo Jin, Mingyi Hong, Changhong Lu, Xiangfeng Wang
Title: Optimization Problem Solving Can Transition to Evolutionary Agentic Workflows
Abstract:
This position paper argues that optimization problem solving can transition from expert-dependent to evolutionary agentic workflows. Traditional optimization practices rely on human specialists for problem formulation, algorithm selection, and hyperparameter tuning, creating bottlenecks that impede industrial adoption of cutting-edge methods. We contend that an evolutionary agentic workflow, powered by foundation models and evolutionary search, can autonomously navigate the optimization space, comprising problem, formulation, algorithm, and hyperparameter spaces. Through case studies in cloud resource scheduling and ADMM parameter adaptation, we demonstrate how this approach can bridge the gap between academic innovation and industrial implementation. Our position challenges the status quo of human-centric optimization workflows and advocates for a more scalable, adaptive approach to solving real-world optimization problems.
中文: 本立场文件主张将优化问题求解从依赖专家的模式转变为进化智能工作流,通过基础模型和进化搜索自主探索优化空间,并在云资源调度和ADMM参数适配案例中验证了该方法的可行性。
English: This position paper advocates for shifting optimization problem solving from human-dependent methods to evolutionary agentic workflows using foundation models and evolutionary search to autonomously navigate optimization spaces, as demonstrated in cloud scheduling and ADMM adaptation case studies.

Authors:Pietro Bonazzi, Christian Vogt, Michael Jost, Haotong Qin, Lyes Khacef, Federico Paredes-Valles, Michele Magno
Title: RGB-Event Fusion with Self-Attention for Collision Prediction
Abstract:
Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.
中文: 本文提出一种融合RGB与事件视觉数据的神经网络框架,通过自注意力机制提升无人机动态避障预测精度,实验表明多模态融合在牺牲计算效率下提高准确性,而纯事件模型在相近成本下性能更具竞争力。
English: This paper introduces a neural network framework that fuses RGB and event-based vision data through self-attention to enhance real-time obstacle collision prediction for drones, showing improved accuracy with fusion but increased computational costs, while event-based models offer a competitive balance of performance and efficiency.

Authors:Luis Moreno, Miguel Altamirano Cabrera, Muhammad Haris Khan, Issatay Tokmurziyev, Yara Mahmoud, Valerii Serpiva, Dzmitry Tsetserukou
Title: FlyHaptics: Flying Multi-contact Haptic Interface
Abstract:
This work presents FlyHaptics, an aerial haptic interface tracked via a Vicon optical motion capture system and built around six five-bar linkage assemblies enclosed in a lightweight protective cage. We predefined five static tactile patterns - each characterized by distinct combinations of linkage contact points and vibration intensities - and evaluated them in a grounded pilot study, where participants achieved 86.5 recognition accuracy (F(4, 35) = 1.47, p = 0.23) with no significant differences between patterns. Complementary flight demonstrations confirmed stable hover performance and consistent force output under realistic operating conditions. These pilot results validate the feasibility of drone-mounted, multi-contact haptic feedback and lay the groundwork for future integration into fully immersive VR, teleoperation, and remote interaction scenarios.
中文: FlyHaptics是一种基于无人机的触觉接口,通过六组五连杆机构传递触觉模式,在测试中获得86.5%的识别准确率,其稳定表现为未来虚拟现实和遥操作应用奠定了基础。
English: FlyHaptics is a drone-based haptic interface using six five-bar linkages to deliver tactile patterns, achieving 86.5% recognition accuracy in tests and demonstrating stable performance for future VR and teleoperation applications.

Authors:Muhammad Haris Khan, Miguel Altamirano Cabrera, Dmitrii Iarchuk, Yara Mahmoud, Daria Trinitatova, Issatay Tokmurziyev, Dzmitry Tsetserukou
Title: HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
Abstract:
This paper introduces HapticVLM, a novel multimodal system that integrates vision-language reasoning with deep convolutional networks to enable real-time haptic feedback. HapticVLM leverages a ConvNeXt-based material recognition module to generate robust visual embeddings for accurate identification of object materials, while a state-of-the-art Vision-Language Model (Qwen2-VL-2B-Instruct) infers ambient temperature from environmental cues. The system synthesizes tactile sensations by delivering vibrotactile feedback through speakers and thermal cues via a Peltier module, thereby bridging the gap between visual perception and tactile experience. Experimental evaluations demonstrate an average recognition accuracy of 84.67% across five distinct auditory-tactile patterns and a temperature estimation accuracy of 86.7% based on a tolerance-based evaluation method with an 8°C margin of error across 15 scenarios. Although promising, the current study is limited by the use of a small set of prominent patterns and a modest participant pool. Future work will focus on expanding the range of tactile patterns and increasing user studies to further refine and validate the system's performance. Overall, HapticVLM presents a significant step toward context-aware, multimodal haptic interaction with potential applications in virtual reality, and assistive technologies.
中文: HapticVLM是一种结合视觉语言推理与深度学习的新型多模态系统,能通过实时触觉反馈准确识别物体材料和推断环境温度,在虚拟现实和辅助技术领域具有应用潜力。
English: HapticVLM is a multimodal system that combines vision-language reasoning and deep learning to provide real-time haptic feedback, achieving high accuracy in material recognition and temperature estimation for applications in virtual reality and assistive technologies.

Authors:Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Title: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models
Abstract:
Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
中文: 文本到时间序列生成通过引入领域无关的扩散框架T2S,采用长度自适应编码技术,有效解决了数据稀缺和泛化问题,并在多领域数据集上实现了最优性能。
English: Text-to-Time Series generation addresses data scarcity and generalization issues by introducing a domain-agnostic diffusion framework, T2S, which utilizes length-adaptive encoding and achieves state-of-the-art performance across diverse datasets.

Authors:Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling
Title: LLM-Guided Probabilistic Program Induction for POMDP Model Estimation
Abstract:
Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.
Chinese: 本研究提出了一种利用大型语言模型作为先验知识来生成和优化概率程序,从而学习低复杂度部分可观测马尔可夫决策过程模型的方法,在多个仿真和实际应用中展现出优于传统方法的性能。
English: This research introduces a method for learning low-complexity POMDP models by using an LLM as a prior to generate and refine probabilistic programs, demonstrating superior effectiveness over traditional approaches in various simulated and real-world domains.

Authors:Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
Title: TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
Abstract:
Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
中文:TEMPURA是一种创新的两阶段训练框架,通过结合因果事件推理与细粒度时间分割来提升视频理解能力,在时间定位和高光检测任务上超越了现有模型。
English: TEMPURA is a novel two-stage training framework that enhances video understanding by integrating causal event reasoning with fine-grained temporal segmentation, outperforming existing models on temporal grounding and highlight detection tasks.

Authors:Yang Jin, Jun Lv, Wenye Yu, Hongjie Fang, Yong-Lu Li, Cewu Lu
Title: SIME: Enhancing Policy Self-Improvement with Modal-level Exploration
Abstract:
Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at https://ericjin2002.github.io/SIME/
中文: 机器人通过模态级探索和数据选择实现有效自我提升,产生多样化互动并学习高质量试验,从而开发出更稳健、高成功率的低成本控制策略。
English: Effective robot self-improvement is achieved through modal-level exploration and data selection, enabling diverse interactions and learning from high-quality trials to develop robust, cost-efficient control strategies.

Authors:Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi Mahabadi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
Title: Llama-Nemotron: Efficient Reasoning Models
Abstract:
We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.
中文:Llama-Nemotron系列推出了异构推理模型,提供三种规模,具备卓越的推理性能和高效的推理能力,支持动态推理切换,并以开放许可发布,同时提供了完整的训练资源。
English: The Llama-Nemotron series introduces heterogeneous reasoning models with three scalable sizes that deliver competitive reasoning performance and superior inference efficiency, featuring a unique dynamic reasoning toggle and being released under an open license along with training resources.

Authors:Tao Li, Ya-Ting Yang, Yunian Pan, Quanyan Zhu
Title: From Texts to Shields: Convergence of Large Language Models and Cybersecurity
Abstract:
This report explores the convergence of large language models (LLMs) and cybersecurity, synthesizing interdisciplinary insights from network security, artificial intelligence, formal methods, and human-centered design. It examines emerging applications of LLMs in software and network security, 5G vulnerability analysis, and generative security engineering. The report highlights the role of agentic LLMs in automating complex tasks, improving operational efficiency, and enabling reasoning-driven security analytics. Socio-technical challenges associated with the deployment of LLMs -- including trust, transparency, and ethical considerations -- can be addressed through strategies such as human-in-the-loop systems, role-specific training, and proactive robustness testing. The report further outlines critical research challenges in ensuring interpretability, safety, and fairness in LLM-based systems, particularly in high-stakes domains. By integrating technical advances with organizational and societal considerations, this report presents a forward-looking research agenda for the secure and effective adoption of LLMs in cybersecurity.
中文: 本报告探讨大语言模型与网络安全的融合,重点分析其在自动化与安全分析中的应用,通过人本策略应对社会技术挑战,并为安全有效应用提出前瞻性研究议程。
English: This report examines the integration of large language models into cybersecurity, highlighting their applications in automation and security analytics while addressing socio-technical challenges through human-centered strategies and outlining a research agenda for safe adoption.

Authors:Maozhe Zhao, Shengzhong Liu, Fan Wu, Guihai Chen
Title: Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations
Abstract:
Mobile video analysis systems often encounter various deploying environments, where environment shifts present greater demands for responsiveness in adaptations of deployed "expert DNN models". Existing model adaptation frameworks primarily operate in a cloud-centric way, exhibiting degraded performance during adaptation and delayed reactions to environment shifts. Instead, this paper proposes MOCHA, a novel framework optimizing the responsiveness of continuous model adaptation through hierarchical collaborations between mobile and cloud resources. Specifically, MOCHA (1) reduces adaptation response delays by performing on-device model reuse and fast fine-tuning before requesting cloud model retrieval and end-to-end retraining; (2) accelerates history expert model retrieval by organizing them into a structured taxonomy utilizing domain semantics analyzed by a cloud foundation model as indices; (3) enables efficient local model reuse by maintaining onboard expert model caches for frequent scenes, which proactively prefetch model weights from the cloud model database. Extensive evaluations with real-world videos on three DNN tasks show MOCHA improves the model accuracy during adaptation by up to 6.8% while saving the response delay and retraining time by up to 35.5x and 3.0x respectively.
中文: 本文提出MOCHA框架,通过移动设备与云端的层级协作优化视频分析模型自适应响应能力,利用设备端快速微调、结构化模型检索和主动缓存机制,显著提升模型精度并大幅降低响应延迟与重训练时间。
English: This paper introduces MOCHA, a framework that enhances mobile video analysis by enabling hierarchical collaboration between mobile devices and the cloud to reduce adaptation delays and improve model accuracy through on-device fine-tuning, structured model retrieval, and proactive caching.

Authors:Nguyen Hoang Khoi Tran, Julie Stephany Berrio, Mao Shan, Zhenxing Ming, Stewart Worrall
Title: InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method
Abstract:
Online localization of road intersections is beneficial for autonomous vehicle localization, mapping and motion planning. Intersections offer strong landmarks for correcting vehicle pose estimation, anchoring new sensor data in up-to-date maps, and guiding vehicle routing in road network graphs. Despite this importance, intersection localization has not been widely studied, with existing methods either ignoring the rich semantic information already computed onboard or relying on scarce, hand-labeled intersection datasets. To close this gap, we present a novel LiDAR-based method for online vehicle-centric intersection localization. We detect the intersection candidates in a bird's eye view (BEV) representation formed by concatenating a sequence of semantic road scans. We then refine these candidates by analyzing the intersecting road branches and adjusting the intersection center point in a least-squares formulation. For evaluation, we introduce an automated pipeline that pairs localized intersection points with OpenStreetMap (OSM) intersection nodes using precise GNSS/INS ground-truth poses. Experiments on the SemanticKITTI dataset show that our method outperforms the latest learning-based baseline in accuracy and reliability. Sensitivity tests demonstrate the method's robustness to challenging segmentation errors, highlighting its applicability in the real world.
中文: 本文提出了一种新颖的基于激光雷达的在线车辆中心化交叉口定位方法,通过利用语义道路扫描和最小二乘优化,在精度和可靠性上超越了现有基于学习的方法,并展现出对分割误差的鲁棒性。
English: This paper introduces a novel LiDAR-based method for online vehicle-centric intersection localization, which outperforms existing learning-based approaches in accuracy and reliability by leveraging semantic road scans and least-squares refinement, while demonstrating robustness to segmentation errors.

Authors:Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Title: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution
Abstract:
Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.
Chinese: KeySync提出了一种两阶段框架,有效解决了唇形同步中的时间一致性、表情泄漏和面部遮挡问题,在唇形重建和跨同步方面取得了最先进的成果。
English: KeySync introduces a two-stage framework that effectively addresses temporal consistency, expression leakage, and facial occlusions in lip synchronization, achieving state-of-the-art results in lip reconstruction and cross-synchronization.

Authors:Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken
Title: Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs
Abstract:
Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
中文: 本研究提出了一种原子事实核查框架,通过将大语言模型生成的医学回答分解为可验证单元,实现了针对性修正和溯源追踪,从而显著提升了事实准确性和可解释性。
English: This study introduces an atomic fact-checking framework that decomposes LLM-generated medical answers into verifiable units, significantly improving factual accuracy and explainability by enabling targeted corrections and source tracing.

Authors:Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken
Title: Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs
Abstract:
Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
中文: 本研究提出了一种原子事实核查框架,通过将大语言模型生成的医学回答分解为可验证单元,实现了针对性修正和溯源追踪,从而显著提升了事实准确性和可解释性。
English: This study introduces an atomic fact-checking framework that decomposes LLM-generated medical answers into verifiable units, significantly improving factual accuracy and explainability by enabling targeted corrections and source tracing.

Authors:Yueqi Zhang, Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: Mind the Quote: Enabling Quotation-Aware Dialogue in LLMs via Plug-and-Play Modules
Abstract:
Human-AI conversation frequently relies on quoting earlier text-"check it with the formula I just highlighted"-yet today's large language models (LLMs) lack an explicit mechanism for locating and exploiting such spans. We formalise the challenge as span-conditioned generation, decomposing each turn into the dialogue history, a set of token-offset quotation spans, and an intent utterance. Building on this abstraction, we introduce a quotation-centric data pipeline that automatically synthesises task-specific dialogues, verifies answer correctness through multi-stage consistency checks, and yields both a heterogeneous training corpus and the first benchmark covering five representative scenarios. To meet the benchmark's zero-overhead and parameter-efficiency requirements, we propose QuAda, a lightweight training-based method that attaches two bottleneck projections to every attention head, dynamically amplifying or suppressing attention to quoted spans at inference time while leaving the prompt unchanged and updating < 2.8% of backbone weights. Experiments across models show that QuAda is suitable for all scenarios and generalises to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.
中文摘要:该研究针对大型语言模型无法处理对话中引用片段的问题,提出了基于片段条件生成的框架,并开发了参数高效的QuAda方法,能在不修改提示的情况下动态调节对引用片段的注意力。
English Summary: The study introduces a framework for span-conditioned generation to address LLMs' inability to handle quoted references in dialogue, proposing QuAda—a parameter-efficient method that dynamically modulates attention to quoted spans without altering prompts.

Authors:Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev
Title: Mind the Gap: A Practical Attack on GGUF Quantization
Abstract:
With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama$.$cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($Δ$=$88.7\%$), targeted content injection ($Δ$=$85.0\%$), and benign instruction refusal ($Δ$=$30.1\%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.
中文摘要:本研究首次针对GGUF量化方法提出攻击,证明可利用量化误差在看似正常的全精度模型中植入隐藏恶意行为,揭示了当前广泛使用的训练后量化方案的安全隐患。
English Summary: This study introduces the first attack on GGUF quantization, demonstrating that quantization errors can be exploited to inject hidden malicious behaviors into models that appear benign in full precision, thereby revealing vulnerabilities in widely used post-training quantization methods.

Authors:Aneeshan Sain, Subhajit Maity, Pinaki Nath Chowdhury, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song
Title: Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
Abstract:
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
Chinese: 本文提出了两种即插即用组件,可将照片高效网络适配于草图数据,在保持精度的同时将计算量减少99.37%,首次实现了面向商业化的高效草图推理方案。
English: This paper introduces two plug-and-play components that adapt photo-efficient networks for sketches, achieving a 99.37% reduction in FLOPs while maintaining accuracy, making efficient sketch inference viable for commercialization.

Authors:Haokun Chen, Yueqi Zhang, Yuan Bi, Yao Zhang, Tong Liu, Jinhe Bi, Jian Lan, Jindong Gu, Claudia Grosser, Denis Krompass, Nassir Navab, Volker Tresp
Title: Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs
Abstract:
In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.
中文: 本文提出一个全面的审计框架,通过整合基准数据集、多种遗忘策略及创新的激活扰动技术,评估大型语言模型的机器遗忘算法,以应对数据隐私问题并提升评估的鲁棒性。
English: This paper introduces a comprehensive auditing framework to evaluate machine unlearning algorithms for large language models, addressing data privacy concerns by incorporating benchmark datasets, multiple unlearning strategies, and a novel activation perturbation technique for robust assessment.

Authors:Linjie Mu, Zhongzhen Huang, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang
Title: Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Abstract:
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
中文摘要:提出的MedE²框架通过两阶段后训练流程增强医学多模态推理能力,在多个基准测试中表现优于基线方法,并在大模型验证中进一步证实了其稳健性和实用性。
English Summary: The proposed MedE² framework enhances medical multimodal reasoning through a two-stage post-training process, demonstrating superior performance across multiple benchmarks and confirming its robustness through validation on larger models.

Authors:Ruixuan Zhang, He Wang, Zhengyu Zhao, Zhiqing Guo, Xun Yang, Yunfeng Diao, Meng Wang
Title: Adversarially Robust AI-Generated Image Detection for Free: An Information Theoretic Perspective
Abstract:
Rapid advances in Artificial Intelligence Generated Images (AIGI) have facilitated malicious use, such as forgery and misinformation. Therefore, numerous methods have been proposed to detect fake images. Although such detectors have been proven to be universally vulnerable to adversarial attacks, defenses in this field are scarce. In this paper, we first identify that adversarial training (AT), widely regarded as the most effective defense, suffers from performance collapse in AIGI detection. Through an information-theoretic lens, we further attribute the cause of collapse to feature entanglement, which disrupts the preservation of feature-label mutual information. Instead, standard detectors show clear feature separation. Motivated by this difference, we propose Training-free Robust Detection via Information-theoretic Measures (TRIM), the first training-free adversarial defense for AIGI detection. TRIM builds on standard detectors and quantifies feature shifts using prediction entropy and KL divergence. Extensive experiments across multiple datasets and attacks validate the superiority of our TRIM, e.g., outperforming the state-of-the-art defense by 33.88% (28.91%) on ProGAN (GenImage), while well maintaining original accuracy.
中文摘要:本文发现对抗训练在AIGI检测中因特征纠缠而失效,并提出无需训练的TRIM防御方法,通过信息论度量量化特征偏移,在保持精度的同时显著优于现有防御方案。
English Summary: This paper identifies that adversarial training fails in AIGI detection due to feature entanglement and proposes TRIM, a training-free defense method using information-theoretic measures that significantly outperforms existing defenses while maintaining accuracy.

Authors:Heng Tang, Feng Liu, Xinbo Chen, Jiawei Chen, Bohao Wang, Changwang Zhang, Jun Wang, Yuegang Sun, Bingde Hu, Can Wang
Title: Bridging the Gap: Self-Optimized Fine-Tuning for LLM-based Recommender Systems
Abstract:
Recent years have witnessed extensive exploration of Large Language Models (LLMs) on the field of Recommender Systems (RS). There are currently two commonly used strategies to enable LLMs to have recommendation capabilities: 1) The "Guidance-Only" strategy uses in-context learning to exploit and amplify the inherent semantic understanding and item recommendation capabilities of LLMs; 2) The "Tuning-Only" strategy uses supervised fine-tuning (SFT) to fine-tune LLMs with the aim of fitting them to real recommendation data. However, neither of these strategies can effectively bridge the gap between the knowledge space of LLMs and recommendation, and their performance do not meet our expectations. To better enable LLMs to learn recommendation knowledge, we combine the advantages of the above two strategies and proposed a novel "Guidance+Tuning" method called Self-Optimized Fine-Tuning (SOFT), which adopts the idea of curriculum learning. It first employs self-distillation to construct an auxiliary easy-to-learn but meaningful dataset from a fine-tuned LLM. Then it further utilizes a self-adaptive curriculum scheduler to enable LLMs to gradually learn from simpler data (self-distilled data) to more challenging data (real RS data). Extensive experiments demonstrate that SOFT significantly enhances the recommendation accuracy (37.59\% on average) of LLM-based methods. The code is available via https://anonymous.4open.science/r/Self-Optimized-Fine-Tuning-264E
中文: 研究者提出名为"自优化微调(SOFT)"的"引导+调优"新方法,通过课程学习和自蒸馏技术弥合大语言模型与推荐系统间的知识鸿沟,将推荐准确率平均提升37.59%。
English: Researchers propose a novel "Guidance+Tuning" method called Self-Optimized Fine-Tuning (SOFT) that combines curriculum learning with self-distillation to bridge the gap between LLMs and recommendation systems, achieving a 37.59% average improvement in recommendation accuracy.

Authors:Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Abstract:
LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
中文: 本研究系统定义并验证了LLM生成基准中的自我偏见现象,将其归因于问题领域、语言风格和错误标签,并提出Silencer框架,通过利用生成器异质性来抑制偏见,显著提升基准质量和评估效果。
English: The study identifies and validates self-bias in LLM-generated benchmarks, attributing it to domain, style, and labeling issues, and introduces Silencer, a framework that mitigates bias by leveraging generator heterogeneity to enhance benchmark quality and evaluation accuracy.

Authors:Zhewei Yao, Guoheng Sun, Lukasz Borchmann, Zheyu Shen, Minghang Deng, Bohan Zhai, Hao Zhang, Ang Li, Yuxiong He
Title: Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL
Abstract:
Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL--particularly for complex queries--remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework's scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.
Chinese: Arctic-Text2SQL-R1提出了一种基于强化学习的框架,通过执行正确性作为轻量级奖励信号来生成精准可执行的SQL查询,在六大基准测试中取得最优性能,其70亿参数模型更超越了此前700亿参数级系统的表现。
English: Arctic-Text2SQL-R1 introduces a reinforcement learning framework that generates highly accurate and executable SQL queries using execution-based rewards, achieving state-of-the-art performance across multiple benchmarks with a compact 7B model surpassing prior 70B systems.

Authors:Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Yao Xiao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang
Title: On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Abstract:
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
中文摘要:该摘要介绍了专用于评估AI历史推理能力的HistBench基准测试,以及历史专用AI智能体HistAgent,它在基准测试中显著超越了现有大型语言模型和通用智能体的表现。
English Summary: The abstract introduces HistBench, a specialized benchmark for evaluating AI's historical reasoning, and HistAgent, a history-specific AI agent that significantly outperforms existing large language models and generalist agents on this benchmark.

Authors:Jinyan Wang, Liu Yang, Yuecen Wei, Jiaxuan Si, Chenhao Guo, Qingyun Sun, Xianxian Li, Xingcheng Fu
Title: An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks
Abstract:
Graph Neural Network-based methods face privacy leakage risks due to the introduction of topological structures about the targets, which allows attackers to bypass the target's prior knowledge of the sensitive attributes and realize membership inference attacks (MIA) by observing and analyzing the topology distribution. As privacy concerns grow, the assumption of MIA, which presumes that attackers can obtain an auxiliary dataset with the same distribution, is increasingly deviating from reality. In this paper, we categorize the distribution diversity issue in real-world MIA scenarios as an Out-Of-Distribution (OOD) problem, and propose a novel Graph OOD Membership Inference Attack (GOOD-MIA) to achieve cross-domain graph attacks. Specifically, we construct shadow subgraphs with distributions from different domains to model the diversity of real-world data. We then explore the stable node representations that remain unchanged under external influences and consider eliminating redundant information from confounding environments and extracting task-relevant key information to more clearly distinguish between the characteristics of training data and unseen data. This OOD-based design makes cross-domain graph attacks possible. Finally, we perform risk extrapolation to optimize the attack's domain adaptability during attack inference to generalize the attack to other domains. Experimental results demonstrate that GOOD-MIA achieves superior attack performance in datasets designed for multiple domains.
中文摘要:图神经网络因引入目标拓扑结构而面临隐私泄露风险,本文提出新型跨域图成员推理攻击方法GOOD-MIA,通过构建多域分布影子图和处理分布外问题,实现在不同数据域间的有效攻击泛化。
English Summary: Graph Neural Networks are vulnerable to privacy breaches through membership inference attacks that exploit topological structures, prompting the development of a novel cross-domain attack method called GOOD-MIA which addresses distribution diversity as an Out-Of-Distribution problem to enhance attack adaptability across different data domains.

Authors:Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, Qifan Wang
Title: Inference Compute-Optimal Video Vision Language Models
Abstract:
This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
中文: 本研究通过分析语言模型规模、帧数和每帧视觉标记之间的权衡,确定了固定推理计算下视频视觉语言模型的最优配置,揭示了任务性能和计算最优边界如何随扩展因子和数据规模变化。
English: This study determines the optimal model configuration for video vision language models under fixed inference compute by analyzing the trade-offs between language model size, frame count, and visual tokens per frame, revealing how task performance and the compute-optimal frontier shift with scaling factors and data size.

Authors:Haoyue Bai, Guodong Chen, Wangyang Ying, Xinyuan Wang, Nanxu Gong, Sixun Dong, Giulia Pedrielli, Haoyu Wang, Haifeng Chen, Yanjie Fu
Title: Brownian Bridge Augmented Surrogate Simulation and Injection Planning for Geological CO$_2$ Storage
Abstract:
Geological CO2 storage (GCS) involves injecting captured CO2 into deep subsurface formations to support climate goals. The effective management of GCS relies on adaptive injection planning to dynamically control injection rates and well pressures to balance both storage safety and efficiency. Prior literature, including numerical optimization methods and surrogate-optimization methods, is limited by real-world GCS requirements of smooth state transitions and goal-directed planning within limited time. To address these limitations, we propose a Brownian Bridge-augmented framework for surrogate simulation and injection planning in GCS and develop two insights: (i) Brownian bridge as a smooth state regularizer for better surrogate simulation; (ii) Brownian bridge as goal-time-conditioned planning guidance for improved injection planning. Our method has three stages: (i) learning deep Brownian bridge representations with contrastive and reconstructive losses from historical reservoir and utility trajectories, (ii) incorporating Brownian bridge-based next state interpolation for simulator regularization, and (iii) guiding injection planning with Brownian utility-conditioned trajectories to generate high-quality injection plans. Experimental results across multiple datasets collected from diverse GCS settings demonstrate that our framework consistently improves simulation fidelity and planning effectiveness while maintaining low computational overhead.
中文摘要:该研究提出的布朗桥增强框架通过平滑状态正则化和目标时间条件引导,有效提升了地质二氧化碳封存中替代模拟的保真度和注入规划的效果,同时保持了较低的计算开销。
English Summary: The proposed Brownian Bridge-augmented framework enhances geological CO₂ storage management by improving surrogate simulation fidelity and injection planning effectiveness through smooth state regularization and goal-time-conditioned guidance, while maintaining low computational costs.

Authors:Di Jin, Jingyi Cao, Xiaobao Wang, Bingdao Feng, Dongxiao He, Longbiao Wang, Jianwu Dang
Title: Rethinking Contrastive Learning in Graph Anomaly Detection: A Clean-View Perspective
Abstract:
Graph anomaly detection aims to identify unusual patterns in graph-based data, with wide applications in fields such as web security and financial fraud detection. Existing methods typically rely on contrastive learning, assuming that a lower similarity between a node and its local subgraph indicates abnormality. However, these approaches overlook a crucial limitation: the presence of interfering edges invalidates this assumption, since it introduces disruptive noise that compromises the contrastive learning process. Consequently, this limitation impairs the ability to effectively learn meaningful representations of normal patterns, leading to suboptimal detection performance. To address this issue, we propose a Clean-View Enhanced Graph Anomaly Detection framework (CVGAD), which includes a multi-scale anomaly awareness module to identify key sources of interference in the contrastive learning process. Moreover, to mitigate bias from the one-step edge removal process, we introduce a novel progressive purification module. This module incrementally refines the graph by iteratively identifying and removing interfering edges, thereby enhancing model performance. Extensive experiments on five benchmark datasets validate the effectiveness of our approach.
Chinese Summary: 本研究提出CVGAD框架,通过渐进式净化干扰边并结合多尺度异常感知机制,有效提升图异常检测中对比学习的性能。
English Summary: The study introduces CVGAD, a framework that improves graph anomaly detection by progressively purifying interfering edges and enhancing contrastive learning through multi-scale anomaly awareness.

Authors:Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Lin Song, Jun Gao, Peng Li, Weiming Hu
Title: DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.
中文: 提出的DetailFusion框架通过动态融合全局与细粒度特征,结合细节优化策略和自适应组合机制,显著提升了组合图像检索的精度和跨领域适应性。
English: The proposed DetailFusion framework enhances composed image retrieval by dynamically integrating global and fine-grained features, achieving state-of-the-art performance through detail-oriented optimization and adaptive fusion.

Authors:Ting-Wei Li, Ruizhong Qiu, Hanghang Tong
Title: Model-Free Graph Data Selection under Distribution Shift
Abstract:
Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElector), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model's predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.
中文:GRADATE是一种无需模型的框架,利用最优传输理论从源域选择最佳训练数据,有效解决图域适应中的分布偏移问题,并以更高的数据效率超越现有方法。
English: GRADATE is a model-free framework that selects optimal training data from the source domain using optimal transport theory, effectively addressing distribution shifts in graph domain adaptation and outperforming existing methods with greater data efficiency.

Authors:Jipeng Zhang, Haolin Yang, Kehao Miao, Ruiyuan Zhang, Renjie Pi, Jiahui Gao, Xiaofang Zhou
Title: ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects
Abstract:
Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.
中文摘要:本文提出ExeSQL框架,通过执行驱动的迭代查询生成、过滤和训练方法,有效解决了文本到SQL模型在多方言适配中的挑战,在多种数据库系统上相比GPT-4o实现了显著性能提升。
English Summary: This paper introduces ExeSQL, an execution-driven framework that iteratively generates, filters, and trains on verifiable SQL queries to bridge the dialect gap in text-to-SQL systems, achieving significant improvements over GPT-4o across multiple database systems.

Authors:Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang
Title: CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Abstract:
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
中文:CoMo提出了一种连续运动表征学习方法,通过时序特征差分和信息瓶颈约束从多样化网络视频中学习精细动态,其零样本泛化能力可生成跨领域伪动作,促进多源数据的策略协同训练。
English: CoMo introduces a continuous motion representation learning method that captures fine-grained dynamics from diverse internet videos, employing temporal feature differences and information bottleneck principles to enhance generalization and enable zero-shot pseudo action generation for unified policy training.

Authors:Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Title: Robust LLM Fingerprinting via Domain-Specific Watermarks
Abstract:
As open-source language models (OSMs) grow more capable and are widely shared and finetuned, ensuring model provenance, i.e., identifying the origin of a given model instance, has become an increasingly important issue. At the same time, existing backdoor-based model fingerprinting techniques often fall short of achieving key requirements of real-world model ownership detection. In this work, we build on the observation that while current open-source model watermarks fail to achieve reliable content traceability, they can be effectively adapted to address the challenge of model provenance. To this end, we introduce the concept of domain-specific watermarking for model fingerprinting. Rather than watermarking all generated content, we train the model to embed watermarks only within specified subdomains (e.g., particular languages or topics). This targeted approach ensures detection reliability, while improving watermark durability and quality under a range of real-world deployment settings. Our evaluations show that domain-specific watermarking enables model fingerprinting with strong statistical guarantees, controllable false positive rates, high detection power, and preserved generation quality. Moreover, we find that our fingerprints are inherently stealthy and naturally robust to real-world variability across deployment scenarios.
Chinese Summary: 本文提出了一种基于语义条件的水印方法,用于大语言模型指纹识别,通过用广泛的语义域和统计信号替代固定查询和异常响应,确保了在各种常见部署场景下的隐蔽性和鲁棒性。
English Summary: This paper introduces a semantically conditioned watermarking method for LLM fingerprinting that replaces fixed queries and atypical responses with a broad semantic domain and statistical signals, ensuring stealth and robustness across common deployment scenarios.

Authors:Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Title: LLM Fingerprinting via Semantically Conditioned Watermarks
Abstract:
Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.
Chinese Summary: 本文提出了一种基于语义条件的水印方法,用于大语言模型指纹识别,通过用广泛的语义域和统计信号替代固定查询和异常响应,确保了在各种常见部署场景下的隐蔽性和鲁棒性。
English Summary: This paper introduces a semantically conditioned watermarking method for LLM fingerprinting that replaces fixed queries and atypical responses with a broad semantic domain and statistical signals, ensuring stealth and robustness across common deployment scenarios.

Authors:Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev
Title: Finetuning-Activated Backdoors in LLMs
Abstract:
Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.
中文: 本文提出FAB攻击方法,通过元学习技术在大型语言模型中植入休眠的对抗行为,这些行为仅在下游微调后被激活,从而挑战了传统微调安全性的固有认知。
English: This paper introduces FAB, a novel attack that compromises large language models through meta-learning to embed dormant adversarial behaviors, which are activated only after downstream finetuning, challenging the assumed security of standard finetuning practices.

Authors:Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev
Title: Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning
Abstract:
Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.
中文: 本文提出FAB攻击方法,通过元学习技术在大型语言模型中植入休眠的对抗行为,这些行为仅在下游微调后被激活,从而挑战了传统微调安全性的固有认知。
English: This paper introduces FAB, a novel attack that compromises large language models through meta-learning to embed dormant adversarial behaviors, which are activated only after downstream finetuning, challenging the assumed security of standard finetuning practices.

Authors:Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, Zhongzhi Li, Rui Yan, Xiuying Chen
Title: ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
中文摘要:ManipLVM-R1提出了一种采用可验证奖励的强化学习框架,通过设计基于规则的奖励函数来提升机器人操作的泛化能力和物理推理,摆脱了对人工标注数据的依赖。
English Summary: ManipLVM-R1 introduces a reinforcement learning framework with verifiable rewards to overcome the limitations of human-annotated data in robotic manipulation, enhancing generalization through affordance perception and trajectory match rewards.

Authors:Xiaoqing Zhang, Huabin Zheng, Ang Lv, Yuhan Liu, Zirui Song, Xiuying Chen, Rui Yan, Flood Sung
Title: Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games
Abstract:
Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL), resembling an ``aha moment'' triggered by simple outcome-based rewards. While RL has proven effective in eliciting such breakthroughs in tasks involving mathematics, coding, and vision, it faces significant challenges in multi-scenario games. The diversity of game rules, interaction modes, and environmental complexities often leads to policies that perform well in one scenario but fail to generalize to others. Simply combining multiple scenarios during training introduces additional challenges, such as training instability and poor performance. To overcome these challenges, we propose Divide-Fuse-Conquer, a framework designed to enhance generalization in multi-scenario RL. This approach starts by heuristically grouping games based on characteristics such as rules and difficulties. Specialized models are then trained for each group to excel at games in the group is what we refer to as the divide step. Next, we fuse model parameters from different groups as a new model, and continue training it for multiple groups, until the scenarios in all groups are conquered. Experiments across 18 TextArena games show that Qwen2.5-32B-Align trained with the Divide-Fuse-Conquer strategy reaches a performance level comparable to Claude3.5, achieving 7 wins and 4 draws. We hope our approach can inspire future research on using reinforcement learning to improve the generalization of LLMs.
中文摘要:Divide-Fuse-Conquer框架通过分组游戏、训练专用模型和融合参数来增强多场景强化学习的泛化能力,在TextArena游戏中达到与Claude3.5相当的性能水平。
English Summary: The Divide-Fuse-Conquer framework enhances multi-scenario reinforcement learning by grouping games, training specialized models, and fusing parameters to overcome generalization challenges, achieving competitive performance with Claude3.5 in TextArena games.

Authors:Yan Zhao, Yang Li, Zhengxue Cheng, Hengdi Zhang, Li Song
Title: TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Hand
Abstract:
Though robotic dexterous manipulation has progressed substantially recently, challenges like in-hand occlusion still necessitate fine-grained tactile perception, leading to the integration of more tactile sensors into robotic hands. Consequently, the increased data volume imposes substantial bandwidth pressure on signal transmission from the hand's controller. However, the acquisition and compression of multi-point tactile signals based on the dexterous hands' physical structures have not been thoroughly explored. In this paper, our contributions are twofold. First, we introduce a Multi-Point Tactile Dataset for Dexterous Hand Grasping (Dex-MPTD). This dataset captures tactile signals from multiple contact sensors across various objects and grasping poses, offering a comprehensive benchmark for advancing dexterous robotic manipulation research. Second, we investigate both lossless and lossy compression on Dex-MPTD by converting tactile data into images and applying six lossless and five lossy image codecs for efficient compression. Experimental results demonstrate that tactile data can be losslessly compressed to as low as 0.0364 bits per sub-sample (bpss), achieving approximately 200$\times$ compression ratio compared to the raw tactile data. Efficient lossy compressors like HM and VTM can achieve about 1000$\times$ data reductions while preserving acceptable data fidelity. The exploration of lossy compression also reveals that screen-content-targeted coding tools outperform general-purpose codecs in compressing tactile data.
中文摘要:本文针对灵巧机器人操作中触觉数据量大的问题,提出了多点触觉数据集,并通过将触觉数据转换为图像进行压缩实验,实现了高达1000倍的数据缩减同时保持可接受的精度。
English Summary: This paper addresses the challenge of handling large tactile data volumes in dexterous robotic manipulation by introducing a multi-point tactile dataset and exploring efficient compression methods that achieve significant data reduction while maintaining acceptable fidelity.

Authors:Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Qunshan Gu, Qi Wang, Li Song
Title: DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor
Abstract:
Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.
DualComp is a lightweight, unified lossless compressor for images and text that overcomes modality-specific limitations by integrating shared and specialized parameters, achieving near real-time performance and matching state-of-the-art compression with minimal model size.
English Summary:

Authors:Zifeng Wang, Benjamin Danek, Jimeng Sun
Title: BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
Abstract:
Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.
中文: BioDSA-1K是一个包含1,029项生物医学任务的基准测试,通过四个关键维度评估AI代理在真实假设验证中的表现,特别涵盖了实际研究中常见的不可验证假设场景。
English: BioDSA-1K is a benchmark of 1,029 data-driven biomedical tasks that evaluates AI agents on realistic hypothesis validation across four key dimensions, including handling non-verifiable claims common in real research.

Authors:Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency
Title: Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition
Abstract:
We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.
Chinese: 我们提出了一种基于大型语言模型的奖励分解框架,利用冻结的预训练模型将全局反馈细化为局部奖励,无需人工设计即可显著提升对话质量。
English: We introduce a framework that uses a frozen large language model to decompose session-level feedback into fine-grained rewards, enabling effective dialogue agent alignment without manual reward engineering.

Authors:Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Haifeng Chen, Yanjie Fu
Title: Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback
Abstract:
The data-to-equation (Data2Eqn) task aims to discover interpretable mathematical equations that map observed values to labels, offering physical insights and broad applicability across academic and industrial domains. Genetic programming and traditional deep learning-based approaches suffer from search inefficiency and poor generalization on small task-specific datasets. Foundation models showed promise in this area, but existing approaches suffer from: 1) They are pretrained on general-purpose data distributions, making them less effective for domain-specific tasks; and 2) their training objectives focus on token-level alignment, overlooking mathematical semantics, which can lead to inaccurate equations. To address these issues, we aim to enhance the domain adaptability of foundation models for Data2Eqn tasks. In this work, we propose a reinforcement learning-based finetuning framework that directly optimizes the generation policy of a pretrained model through reward signals derived from downstream numerical fitness. Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations. Extensive experiments demonstrate that our approach improves both the accuracy and robustness of equation generation under complex distributions.
中文摘要:本研究提出的强化学习微调框架通过下游数值适应度优化预训练模型,有效提升基础模型在数据到方程任务中的领域适应性和数学准确性,显著改善了复杂分布下的方程生成效果。
English Summary: The proposed reinforcement learning framework fine-tunes foundation models to enhance domain adaptability and mathematical accuracy for data-to-equation tasks, overcoming limitations of traditional methods and improving equation generation performance.

Authors:Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu
Title: Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation
Abstract:
Feature Transformation (FT) crafts new features from original ones via mathematical operations to enhance dataset expressiveness for downstream models. However, existing FT methods exhibit critical limitations: discrete search struggles with enormous combinatorial spaces, impeding practical use; and continuous search, being highly sensitive to initialization and step sizes, often becomes trapped in local optima, restricting global exploration. To overcome these limitations, DIFFT redefines FT as a reward-guided generative task. It first learns a compact and expressive latent space for feature sets using a Variational Auto-Encoder (VAE). A Latent Diffusion Model (LDM) then navigates this space to generate high-quality feature embeddings, its trajectory guided by a performance evaluator towards task-specific optima. This synthesis of global distribution learning (from LDM) and targeted optimization (reward guidance) produces potent embeddings, which a novel semi-autoregressive decoder efficiently converts into structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation. Extensive experiments on 14 benchmark datasets show DIFFT consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
Chinese: DIFFT通过变分自编码器和潜在扩散模型生成优化的特征嵌入,再解码为离散特征,克服了特征变换的局限性,在多个数据集上实现了更高的准确性、鲁棒性和效率。
English: DIFFT overcomes limitations in feature transformation by using a VAE and latent diffusion model to generate optimized feature embeddings, which are then decoded into discrete features, achieving superior accuracy, robustness, and efficiency across multiple datasets.

Authors:Nanxu Gong, Sixun Dong, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Yanjie Fu
Title: Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories
Abstract:
As a widely-used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi-Agent System with Long and Short-Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in-context learning with short-term memory for immediate feedback refinement and long-term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine-tuning to train the router agent for effective decision-making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.
中文摘要:本文提出了一种代理特征增强框架,通过结合长短时记忆的多智能体系统统一特征选择与生成,利用智能体协同与强化学习优化决策,在减少冗余的同时有效扩充特征维度,显著提升模型性能。
English Summary: This paper introduces an agentic feature augmentation framework that unifies feature selection and generation through a multi-agent system with memory mechanisms, achieving superior AI performance by intelligently balancing redundancy reduction and feature enrichment.

Authors:Xiao Lin, Zhining Liu, Ze Yang, Gaotang Li, Ruizhong Qiu, Shuke Wang, Hui Liu, Haotian Li, Sumit Keswani, Vishwa Pardeshi, Huijun Zhao, Wei Fan, Hanghang Tong
Title: MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models
Abstract:
Warning: This paper contains examples of harmful language and images. Reader discretion is advised. Recently, vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis, owing to their powerful multimodal reasoning capabilities. As these models are deployed in high-stakes real-world applications, it is of paramount importance to ensure that their outputs align with human moral values and remain within moral boundaries. However, existing work on moral alignment either focuses solely on textual modalities or relies heavily on AI-generated images, leading to distributional biases and reduced realism. To overcome these limitations, we introduce MORALISE, a comprehensive benchmark for evaluating the moral alignment of vision-language models (VLMs) using diverse, expert-verified real-world data. We begin by proposing a comprehensive taxonomy of 13 moral topics grounded in Turiel's Domain Theory, spanning the personal, interpersonal, and societal moral domains encountered in everyday life. Built on this framework, we manually curate 2,481 high-quality image-text pairs, each annotated with two fine-grained labels: (1) topic annotation, identifying the violated moral topic(s), and (2) modality annotation, indicating whether the violation arises from the image or the text. For evaluation, we encompass two tasks, \textit{moral judgment} and \textit{moral norm attribution}, to assess models' awareness of moral violations and their reasoning ability on morally salient content. Extensive experiments on 19 popular open- and closed-source VLMs show that MORALISE poses a significant challenge, revealing persistent moral limitations in current state-of-the-art models. The full benchmark is publicly available at https://huggingface.co/datasets/Ze1025/MORALISE.
中文摘要:本文提出MORALISE基准,通过真实图像文本数据评估视觉语言模型的道德对齐能力,发现现有先进模型仍存在明显的道德认知缺陷。
English Summary: This paper introduces MORALISE, a benchmark using real-world image-text pairs to evaluate vision-language models' moral alignment, revealing significant ethical shortcomings in current models despite their advanced capabilities.

Authors:Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, Liang-Chieh Chen
Title: Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers
Abstract:
Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.
中文总结:GRAT是一种无需训练的注意力加速方法,通过令牌分组和限制注意力至结构化区域,在不损失生成质量的前提下显著提升了扩散Transformer在图像和视频生成中的速度。
English summary: GRAT is a training-free attention acceleration method that speeds up Diffusion Transformers by grouping tokens and restricting attention to structured regions, achieving significant speedups in image and video generation without quality loss.

Authors:Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro
Title: Source Verification for Speech Deepfakes
Abstract:
With the proliferation of speech deepfake generators, it becomes crucial not only to assess the authenticity of synthetic audio but also to trace its origin. While source attribution models attempt to address this challenge, they often struggle in open-set conditions against unseen generators. In this paper, we introduce the source verification task, which, inspired by speaker verification, determines whether a test track was produced using the same model as a set of reference signals. Our approach leverages embeddings from a classifier trained for source attribution, computing distance scores between tracks to assess whether they originate from the same source. We evaluate multiple models across diverse scenarios, analyzing the impact of speaker diversity, language mismatch, and post-processing operations. This work provides the first exploration of source verification, highlighting its potential and vulnerabilities, and offers insights for real-world forensic applications.
中文总结:本文提出了一种声源验证方法,通过分析音频嵌入距离来判断是否来自同一深度伪造生成器,在多场景下评估性能并揭示了该技术的潜力与脆弱性。
English Summary: This paper introduces a source verification task to determine if audio tracks originate from the same deepfake generator by analyzing embedding distances, evaluating performance across various real-world scenarios while revealing both capabilities and vulnerabilities.

Authors:Songhao Wu, Quan Tu, Hong Liu, Jia Xu, Zhongyi Liu, Guannan Zhang, Ran Wang, Xiuying Chen, Rui Yan
Title: Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search
Abstract:
Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
中文摘要:本文提出的符号图排序器(SGR)通过符号语法规则将图结构与大型语言模型相结合,并采用自监督学习增强模型对拓扑信息的理解,在基准测试中表现优越,为传统搜索策略与现代大语言模型的融合提供了创新方法。
English Summary: The proposed Symbolic Graph Ranker (SGR) integrates graph structures with large language models through symbolic grammar rules and self-supervised learning, effectively bridging traditional search methods with modern AI capabilities while demonstrating superior performance on benchmark datasets.

Authors:Pengcheng Jiang, Xueqiang Xu, Jiacheng Lin, Jinfeng Xiao, Zifeng Wang, Jimeng Sun, Jiawei Han
Title: s3: You Don't Need That Much Data to Train a Search Agent via RL
Abstract:
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.
中文: 提出的s3框架将检索与生成解耦,通过基于生成准确率提升的奖励机制训练轻量级检索器,仅需少量训练数据即可在多个基准测试中实现更优性能。
English: The proposed s3 framework decouples retrieval from generation in RAG systems, training a lightweight searcher using a reward based on generation accuracy improvement over naive RAG, achieving superior performance with minimal training data across multiple benchmarks.

Authors:Yu Cui, Feng Liu, Jiawei Chen, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, Xiaohu Yang, Can Wang
Title: Field Matters: A lightweight LLM-enhanced Method for CTR Prediction
Abstract:
Click-through rate (CTR) prediction is a fundamental task in modern recommender systems. In recent years, the integration of large language models (LLMs) has been shown to effectively enhance the performance of traditional CTR methods. However, existing LLM-enhanced methods often require extensive processing of detailed textual descriptions for large-scale instances or user/item entities, leading to substantial computational overhead. To address this challenge, this work introduces LLaCTR, a novel and lightweight LLM-enhanced CTR method that employs a field-level enhancement paradigm. Specifically, LLaCTR first utilizes LLMs to distill crucial and lightweight semantic knowledge from small-scale feature fields through self-supervised field-feature fine-tuning. Subsequently, it leverages this field-level semantic knowledge to enhance both feature representation and feature interactions. In our experiments, we integrate LLaCTR with six representative CTR models across four datasets, demonstrating its superior performance in terms of both effectiveness and efficiency compared to existing LLM-enhanced methods. Our code is available at https://anonymous.4open.science/r/LLaCTR-EC46.
中文:LLaCTR是一种轻量级方法,通过利用大语言模型从特征字段中提取语义知识来增强点击率预测,在提升特征表示和交互的同时实现了更优的效率和性能。
English: LLaCTR is a lightweight method that enhances CTR prediction by using LLMs to distill semantic knowledge from feature fields, improving both feature representation and interactions with superior efficiency and effectiveness.

Authors:Hao Dong, Ziyue Qiao, Zhiyuan Ning, Qi Hao, Yi Du, Pengyang Wang, Yuanchun Zhou
Title: Disentangled Multi-span Evolutionary Network against Temporal Knowledge Graph Reasoning
Abstract:
Temporal Knowledge Graphs (TKGs), as an extension of static Knowledge Graphs (KGs), incorporate the temporal feature to express the transience of knowledge by describing when facts occur. TKG extrapolation aims to infer possible future facts based on known history, which has garnered significant attention in recent years. Some existing methods treat TKG as a sequence of independent subgraphs to model temporal evolution patterns, demonstrating impressive reasoning performance. However, they still have limitations: 1) In modeling subgraph semantic evolution, they usually neglect the internal structural interactions between subgraphs, which are actually crucial for encoding TKGs. 2) They overlook the potential smooth features that do not lead to semantic changes, which should be distinguished from the semantic evolution process. Therefore, we propose a novel Disentangled Multi-span Evolutionary Network (DiMNet) for TKG reasoning. Specifically, we design a multi-span evolution strategy that captures local neighbor features while perceiving historical neighbor semantic information, thus enabling internal interactions between subgraphs during the evolution process. To maximize the capture of semantic change patterns, we design a disentangle component that adaptively separates nodes' active and stable features, used to dynamically control the influence of historical semantics on future evolution. Extensive experiments conducted on four real-world TKG datasets show that DiMNet demonstrates substantial performance in TKG reasoning, and outperforms the state-of-the-art up to 22.7% in MRR.
中文: 时序知识图谱外推通过建模时序演化来推断未来事实,提出的DiMNet通过多跨度演化和特征解耦增强推理能力,实现了领先的性能表现。
English: Temporal Knowledge Graph extrapolation infers future facts by modeling temporal evolution, and the proposed DiMNet enhances reasoning through multi-span evolution and feature disentanglement, achieving state-of-the-art performance.

Authors:Yasi Zhang, Tianyu Chen, Zhendong Wang, Ying Nian Wu, Mingyuan Zhou, Oscar Leong
Title: Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation
Abstract:
Learning generative models from corrupted data is a fundamental yet persistently challenging task across scientific disciplines, particularly when access to clean data is limited or expensive. Denoising Score Distillation (DSD) \cite{chen2025denoising} recently introduced a novel and surprisingly effective strategy that leverages score distillation to train high-fidelity generative models directly from noisy observations. Building upon this foundation, we propose \textit{Restoration Score Distillation} (RSD), a principled generalization of DSD that accommodates a broader range of corruption types, such as blurred, incomplete, or low-resolution images. RSD operates by first pretraining a teacher diffusion model solely on corrupted data and subsequently distilling it into a single-step generator that produces high-quality reconstructions. Empirically, RSD consistently surpasses its teacher model across diverse restoration tasks on both natural and scientific datasets. Moreover, beyond standard diffusion objectives, the RSD framework is compatible with several corruption-aware training techniques such as Ambient Tweedie, Ambient Diffusion, and its Fourier-space variant, enabling flexible integration with recent advances in diffusion modeling. Theoretically, we demonstrate that in a linear regime, RSD recovers the eigenspace of the clean data covariance matrix from linear measurements, thereby serving as an implicit regularizer. This interpretation recasts score distillation not only as a sampling acceleration technique but as a principled approach to enhancing generative performance in severely degraded data regimes.
中文:修复分数蒸馏(RSD)通过直接在模糊或不完整等损坏数据上训练生成模型,扩展了去噪分数蒸馏方法,在多种修复任务中超越教师模型,并从理论上实现了对干净数据结构的恢复。
English: Restoration Score Distillation (RSD) extends Denoising Score Distillation by training generative models directly on corrupted data like blurred or incomplete images, outperforming teacher models across restoration tasks while theoretically recovering clean data structures.

Authors:Zhichen Zeng, Ruizhong Qiu, Wenxuan Bao, Tianxin Wei, Xiao Lin, Yuchen Yan, Tarek F. Abdelzaher, Jiawei Han, Hanghang Tong
Title: Pave Your Own Path: Graph Gradual Domain Adaptation on Fused Gromov-Wasserstein Geodesics
Abstract:
Graph neural networks, despite their impressive performance, are highly vulnerable to distribution shifts on graphs. Existing graph domain adaptation (graph DA) methods often implicitly assume a \textit{mild} shift between source and target graphs, limiting their applicability to real-world scenarios with \textit{large} shifts. Gradual domain adaptation (GDA) has emerged as a promising approach for addressing large shifts by gradually adapting the source model to the target domain via a path of unlabeled intermediate domains. Existing GDA methods exclusively focus on independent and identically distributed (IID) data with a predefined path, leaving their extension to \textit{non-IID graphs without a given path} an open challenge. To bridge this gap, we present Gadget, the first GDA framework for non-IID graph data. First (\textit{theoretical foundation}), the Fused Gromov-Wasserstein (FGW) distance is adopted as the domain discrepancy for non-IID graphs, based on which, we derive an error bound revealing that the target domain error is proportional to the length of the path. Second (\textit{optimal path}), guided by the error bound, we identify the FGW geodesic as the optimal path, which can be efficiently generated by our proposed algorithm. The generated path can be seamlessly integrated with existing graph DA methods to handle large shifts on graphs, improving state-of-the-art graph DA methods by up to 6.8\% in node classification accuracy on real-world datasets.
中文: 图神经网络易受分布偏移影响,Gadget框架针对非独立同分布图数据,采用融合Gromov-Wasserstein距离构建最优适应路径,成功处理大幅偏移,将节点分类准确率最高提升6.8%。
English: Graph neural networks are vulnerable to distribution shifts, and the proposed Gadget framework addresses large shifts in non-IID graph data by using the Fused Gromov-Wasserstein distance to generate an optimal adaptation path, improving node classification accuracy by up to 6.8%.

Authors:Jinyuan Liu, Yuchen Sun, Yin Yang, Chenfanfu Jiang, Minchen Li, Bo Zhu
Title: Penetration-free Solid-Fluid Interaction on Shells and Rods
Abstract:
We introduce a novel approach to simulate the interaction between fluids and thin elastic solids without any penetration. Our approach is centered around an optimization system augmented with barriers, which aims to find a configuration that ensures the absence of penetration while enforcing incompressibility for the fluids and minimizing elastic potentials for the solids. Unlike previous methods that primarily focus on velocity coherence at the fluid-solid interfaces, we demonstrate the effectiveness and flexibility of explicitly resolving positional constraints, including both explicit representation of solid positions and the implicit representation of fluid level-set interface. To preserve the volume of the fluid, we propose a simple yet efficient approach that adjusts the associated level-set values. Additionally, we develop a distance metric capable of measuring the separation between an implicitly represented surface and a Lagrangian object of arbitrary codimension. By integrating the inertia, solid elastic potential, damping, barrier potential, and fluid incompressibility within a unified system, we are able to robustly simulate a wide range of processes involving fluid interactions with lower-dimensional objects such as shells and rods. These processes include topology changes, bouncing, splashing, sliding, rolling, floating, and more.
中文摘要:本文提出了一种基于屏障增强优化的新方法,用于模拟流体与薄弹性固体间的无穿透交互,通过统一系统有效处理位置约束和体积保持问题。
English Summary: This paper presents a barrier-augmented optimization method for simulating fluid-thin elastic solid interactions without penetration, effectively handling positional constraints and volume preservation through a unified system.

Authors:Yating Liu, Yujie Zhang, Qi Yang, Yiling Xu, Zhu Li, Ye-Kui Wang
Title: DPCD: A Quality Assessment Database for Dynamic Point Clouds
Abstract:
Recently, the advancements in Virtual/Augmented Reality (VR/AR) have driven the demand for Dynamic Point Clouds (DPC). Unlike static point clouds, DPCs are capable of capturing temporal changes within objects or scenes, offering a more accurate simulation of the real world. While significant progress has been made in the quality assessment research of static point cloud, little study has been done on Dynamic Point Cloud Quality Assessment (DPCQA), which hinders the development of quality-oriented applications, such as interframe compression and transmission in practical scenarios. In this paper, we introduce a large-scale DPCQA database, named DPCD, which includes 15 reference DPCs and 525 distorted DPCs from seven types of lossy compression and noise distortion. By rendering these samples to Processed Video Sequences (PVS), a comprehensive subjective experiment is conducted to obtain Mean Opinion Scores (MOS) from 21 viewers for analysis. The characteristic of contents, impact of various distortions, and accuracy of MOSs are presented to validate the heterogeneity and reliability of the proposed database. Furthermore, we evaluate the performance of several objective metrics on DPCD. The experiment results show that DPCQA is more challenge than that of static point cloud. The DPCD, which serves as a catalyst for new research endeavors on DPCQA, is publicly available at https://huggingface.co/datasets/Olivialyt/DPCD.
中文: 随着VR/AR技术的发展,动态点云需求增长,但其质量评估研究尚不充分,为此建立的DPCD数据库通过主观实验验证了多种失真影响,为动态点云质量评估提供了可靠数据支持。
English: Recent VR/AR advancements have increased demand for Dynamic Point Clouds (DPC), yet research on their quality assessment remains limited, prompting the creation of the DPCD database to evaluate distortions and support future DPCQA studies.

Authors:Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, Ye Yuan, Guoren Wang, Conghui He, Lei Cao
Title: Not All Documents Are What You Need for Extracting Instruction Tuning Data
Abstract:
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
Chinese: 针对LLM合成指令数据多样性不足的问题,EQUAL框架通过迭代选择相关文档聚类并提取高质量问答对,从网络语料中高效获取指令调优数据,将计算成本降低5-10倍,并使模型准确率提升2.5%。
English: To overcome the limitations of low diversity in LLM-synthesized instruction data, the EQUAL framework efficiently extracts high-quality question-answer pairs from web corpora by iteratively selecting relevant document clusters and extracting valuable data, reducing computational costs by 5-10x and improving model accuracy by 2.5%.

Authors:Dong Yang, Yiyi Cai, Yuki Saito, Lixu Wang, Hiroshi Saruwatari
Title: Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis
Abstract:
We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.
中文: 提出的浅层流匹配机制通过从中间状态而非纯噪声开始推理,有效提升语音合成的自然度并减少推理时间。
English: The proposed shallow flow matching (SFM) mechanism enhances text-to-speech models by starting inference from intermediate states rather than pure noise, improving speech naturalness while reducing inference time.

Authors:Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
Title: Learning to Highlight Audio by Watching Movies
Abstract:
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.
中文摘要:本文提出了一种基于Transformer的多模态框架,通过视频引导的音频高亮技术来改善视听协调性,利用新构建的数据集在定量和主观评估中均优于现有基线方法。
English Summary: The paper introduces a transformer-based framework for visually-guided acoustic highlighting to enhance audio-visual harmony by transforming audio based on video cues, using a novel dataset and outperforming existing methods.

Authors:Nicholas Carlini, Milad Nasr, Edoardo Debenedetti, Barry Wang, Christopher A. Choquette-Choo, Daphne Ippolito, Florian Tramèr, Matthew Jagielski
Title: LLMs unlock new paths to monetizing exploits
Abstract:
We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-identify bug in a product with millions of users, LLMs can find thousands of easy-to-identify bugs in products with thousands of users. And on the monetization front, instead of generic ransomware that always performs the same attack (encrypt all your data and request payment to decrypt), an LLM-driven ransomware attack could tailor the ransom demand based on the particular content of each exploited device. We show that these two attacks (and several others) are imminently practical using state-of-the-art LLMs. For example, we show that without any human intervention, an LLM finds highly sensitive personal information in the Enron email dataset (e.g., an executive having an affair with another employee) that could be used for blackmail. While some of our attacks are still too expensive to scale widely today, the incentives to implement these attacks will only increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new defense-in-depth approaches.
中文: 大型语言模型将改变网络攻击的经济模式,使其能够进行个性化大规模攻击和定制化勒索,从而迫切需要新的纵深防御方法。
English: Large language models are poised to revolutionize cyberattacks by enabling personalized, large-scale exploitation and tailored ransom demands, necessitating new defense strategies.

Authors:Mafalda Malafaia, Thalea Schlender, Tanja Alderliesten, Peter A. N. Bosman
Title: A Step towards Interpretable Multimodal AI Models with MultiFIX
Abstract:
Real-world problems are often dependent on multiple data modalities, making multimodal fusion essential for leveraging diverse information sources. In high-stakes domains, such as in healthcare, understanding how each modality contributes to the prediction is critical to ensure trustworthy and interpretable AI models. We present MultiFIX, an interpretability-driven multimodal data fusion pipeline that explicitly engineers distinct features from different modalities and combines them to make the final prediction. Initially, only deep learning components are used to train a model from data. The black-box (deep learning) components are subsequently either explained using post-hoc methods such as Grad-CAM for images or fully replaced by interpretable blocks, namely symbolic expressions for tabular data, resulting in an explainable model. We study the use of MultiFIX using several training strategies for feature extraction and predictive modeling. Besides highlighting strengths and weaknesses of MultiFIX, experiments on a variety of synthetic datasets with varying degrees of interaction between modalities demonstrate that MultiFIX can generate multimodal models that can be used to accurately explain both the extracted features and their integration without compromising predictive performance.
中文摘要:MultiFIX是一种可解释的多模态融合流程,通过事后解释方法或符号表达式替换黑盒深度学习组件,在保持预测性能的同时实现对多模态特征及其整合过程的准确解释。
English Summary: MultiFIX is an interpretable multimodal fusion pipeline that transforms black-box deep learning components into explainable models through post-hoc explanations or symbolic replacements while maintaining predictive accuracy across diverse data interactions.

Authors:Yunkang Cao, Yuqi Cheng, Xiaohao Xu, Yiheng Zhang, Yihan Sun, Yuxiang Tan, Yuxin Zhang, Xiaonan Huang, Weiming Shen
Title: Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark
Abstract:
The practical deployment of Visual Anomaly Detection (VAD) systems is hindered by their sensitivity to real-world imaging variations, particularly the complex interplay between viewpoint and illumination which drastically alters defect visibility. Current benchmarks largely overlook this critical challenge. We introduce Multi-View Multi-Illumination Anomaly Detection (M2AD), a new large-scale benchmark comprising 119,880 high-resolution images designed explicitly to probe VAD robustness under such interacting conditions. By systematically capturing 999 specimens across 10 categories using 12 synchronized views and 10 illumination settings (120 configurations total), M2AD enables rigorous evaluation. We establish two evaluation protocols: M2AD-Synergy tests the ability to fuse information across diverse configurations, and M2AD-Invariant measures single-image robustness against realistic view-illumination effects. Our extensive benchmarking shows that state-of-the-art VAD methods struggle significantly on M2AD, demonstrating the profound challenge posed by view-illumination interplay. This benchmark serves as an essential tool for developing and validating VAD methods capable of overcoming real-world complexities. Our full dataset and test suite will be released at https://hustcyq.github.io/M2AD to facilitate the field.
Chinese: M2AD基准通过12种视角和10种光照组合生成的119,880张高清图像,首次系统揭示了视觉异常检测方法在真实成像变化下的脆弱性,为攻克视角-光照交互难题提供了关键评估工具。
English: The M2AD benchmark addresses the critical gap in visual anomaly detection by introducing 119,880 high-resolution images across 120 view-illumination configurations, revealing significant performance drops in state-of-the-art methods under realistic imaging variations.

Authors:Shuchen Guo, Yun Wang, Jichao Yu, Xuansheng Wu, Bilgehan Ayik, Field M. Watts, Ehsan Latif, Ninghao Liu, Lei Liu, Xiaoming Zhai
Title: Artificial Intelligence Bias on English Language Learners in Automatic Scoring
Abstract:
This study investigated potential scoring biases and disparities toward English Language Learners (ELLs) when using automatic scoring systems for middle school students' written responses to science assessments. We specifically focus on examining how unbalanced training data with ELLs contributes to scoring bias and disparities. We fine-tuned BERT with four datasets: responses from (1) ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs (unbalanced), and (4) a balanced mixed dataset with equal representation of both groups. The study analyzed 21 assessment items: 10 items with about 30,000 ELL responses, five items with about 1,000 ELL responses, and six items with about 200 ELL responses. Scoring accuracy (Acc) was calculated and compared to identify bias using Friedman tests. We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities. We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL = 30,000 and ELL = 1,000), but concerns could exist if the sample size is limited (ELL = 200).
中文: 本研究发现,当训练数据集足够大时,科学评估的自动评分系统不会对英语学习者产生偏见或扭曲的评分差异,但在样本量较小时可能存在潜在问题。
English: This study found that automatic scoring systems for science assessments do not exhibit bias or distorted disparities against English Language Learners when trained on sufficiently large datasets, but potential concerns arise with smaller sample sizes.

Authors:Xuebo Ji, Zherong Pan, Xifeng Gao, Lei Yang, Xinxin Du, Kaiyun Li, Yongjin Liu, Wenping Wang, Changhe Tu, Jia Pan
Title: Internal State Estimation in Groups via Active Information Gathering
Abstract:
Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in complex, multi-human settings difficult. In this work, we propose a practical method for active human personality estimation in groups, with a focus on applications related to Autism Spectrum Disorder (ASD). Our method combines a personality-conditioned behavior model, based on the Eysenck 3-Factor theory, with an active robot information gathering policy that triggers human behaviors through a receding-horizon planner. The robot's belief about human personality is then updated via Bayesian inference. We demonstrate the effectiveness of our approach through simulations, user studies with typical adults, and preliminary experiments involving participants with ASD. Our results show that our method can scale to tens of humans and reduce personality prediction error by 29.2% and uncertainty by 79.9% in simulation. User studies with typical adults confirm the method's ability to generalize across complex personality distributions. Additionally, we explore its application in autism-related scenarios, demonstrating that the method can identify the difference between neurotypical and autistic behavior, highlighting its potential for diagnosing ASD. The results suggest that our framework could serve as a foundation for future ASD-specific interventions.
中文: 本研究提出了一种主动式机器人方法,用于群体环境中的实时人格特质估计,该方法结合人格条件行为模型与贝叶斯推理,显著降低了预测误差和不确定性,并在自闭症谱系障碍诊断中展现出应用潜力。
English: This study introduces an active robot-based method for real-time personality estimation in group settings, leveraging a personality-conditioned behavior model and Bayesian inference to significantly reduce prediction errors and uncertainty, with demonstrated applications in Autism Spectrum Disorder diagnosis and intervention potential.

Authors:Yumin Choi, Jinheon Baek, Sung Ju Hwang
Title: System Prompt Optimization with Meta-Learning
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
中文摘要:本研究提出了双层系统提示优化方法,通过元学习框架开发适用于不同任务和领域的系统提示,实验证明该方法能有效提升模型性能并减少优化步骤。
English Summary: This study introduces bilevel system prompt optimization to create robust and transferable system prompts for LLMs, using a meta-learning framework that enhances performance across diverse tasks and domains with fewer optimization steps.

Authors:Xiangyuan Peng, Yu Wang, Miao Tang, Bierzynski Kay, Lorenzo Servadei, Robert Wille
Title: MoRAL: Motion-aware Multi-Frame 4D Radar and LiDAR Fusion for Robust 3D Object Detection
Abstract:
Reliable autonomous driving systems require accurate detection of traffic participants. To this end, multi-modal fusion has emerged as an effective strategy. In particular, 4D radar and LiDAR fusion methods based on multi-frame radar point clouds have demonstrated the effectiveness in bridging the point density gap. However, they often neglect radar point clouds' inter-frame misalignment caused by object movement during accumulation and do not fully exploit the object dynamic information from 4D radar. In this paper, we propose MoRAL, a motion-aware multi-frame 4D radar and LiDAR fusion framework for robust 3D object detection. First, a Motion-aware Radar Encoder (MRE) is designed to compensate for inter-frame radar misalignment from moving objects. Later, a Motion Attention Gated Fusion (MAGF) module integrate radar motion features to guide LiDAR features to focus on dynamic foreground objects. Extensive evaluations on the View-of-Delft (VoD) dataset demonstrate that MoRAL outperforms existing methods, achieving the highest mAP of 73.30% in the entire area and 88.68% in the driving corridor. Notably, our method also achieves the best AP of 69.67% for pedestrians in the entire area and 96.25% for cyclists in the driving corridor.
中文: 本文提出MoRAL,一种运动感知的4D雷达与激光雷达融合框架,通过补偿帧间错位并利用物体动态信息来提升3D检测性能,在VoD数据集上实现了最优效果。
English: This paper introduces MoRAL, a motion-aware fusion framework for 4D radar and LiDAR that compensates for inter-frame misalignment and leverages object dynamics to enhance 3D detection, achieving state-of-the-art performance on the VoD dataset.

Authors:Denis Donadel, Kavya Balasubramanian, Alessandro Brighente, Bhaskar Ramasubramanian, Mauro Conti, Radha Poovendran
Title: CANTXSec: A Deterministic Intrusion Detection and Prevention System for CAN Bus Monitoring ECU Activations
Abstract:
Despite being a legacy protocol with various known security issues, Controller Area Network (CAN) still represents the de-facto standard for communications within vehicles, ships, and industrial control systems. Many research works have designed Intrusion Detection Systems (IDSs) to identify attacks by training machine learning classifiers on bus traffic or its properties. Actions to take after detection are, on the other hand, less investigated, and prevention mechanisms usually include protocol modification (e.g., adding authentication). An effective solution has yet to be implemented on a large scale in the wild. The reasons are related to the effort to handle sporadic false positives, the inevitable delay introduced by authentication, and the closed-source automobile environment that does not easily permit modifying Electronic Control Units (ECUs) software. In this paper, we propose CANTXSec, the first deterministic Intrusion Detection and Prevention system based on physical ECU activations. It employs a new classification of attacks based on the attacker's need in terms of access level to the bus, distinguishing between Frame Injection Attacks (FIAs) (i.e., using frame-level access) and Single-Bit Attacks (SBAs) (i.e., employing bit-level access). CANTXSec detects and prevents classical attacks in the CAN bus, while detecting advanced attacks that have been less investigated in the literature. We prove the effectiveness of our solution on a physical testbed, where we achieve 100% detection accuracy in both classes of attacks while preventing 100% of FIAs. Moreover, to encourage developers to employ CANTXSec, we discuss implementation details, providing an analysis based on each user's risk assessment.
中文摘要:尽管CAN总线存在已知安全缺陷且现有防护方案存在实施困难,CANTXSec作为首个基于ECU物理激活的确定性入侵检测防御系统,在物理测试中实现了对帧注入攻击100%的拦截率和所有攻击的精准检测,同时提供了分险评估实施方案。
English Summary: Despite CAN's known vulnerabilities and the limitations of existing security measures, CANTXSec introduces a novel intrusion detection and prevention system that effectively identifies and blocks attacks with 100% accuracy for frame injection attacks and detection for advanced threats, validated on physical hardware.

Authors:Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, Rui Yan
Title: The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News
Abstract:
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs' reasoning abilities fully. Inspired by the saying that "truth becomes clearer through debate," our study introduces a novel multi-agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross-examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub-agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role-aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.
中文摘要:本研究提出TruEDebate(TED)系统,通过模拟正反方辩论的多智能体框架,利用大语言模型提升假新闻检测的可解释性和检测效果。
English Summary: This study introduces TruEDebate (TED), a multi-agent system using large language models that enhances fake news detection through simulated debates between supporting and challenging teams, improving both interpretability and effectiveness.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
Abstract:
State-space models (SSMs), particularly the Mamba architecture, have emerged as powerful alternatives to Transformers for sequence modeling, offering linear-time complexity and competitive performance across diverse tasks. However, their large parameter counts pose significant challenges for deployment in resource-constrained environments. We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70\% parameter reduction while retaining over 95\% of the original performance. Our approach integrates three key innovations: (1) a gradient-aware magnitude pruning technique that combines weight magnitude and gradient information to identify less critical parameters, (2) an iterative pruning schedule that gradually increases sparsity to maintain model stability, and (3) a global pruning strategy that optimizes parameter allocation across the entire model. Through extensive experiments on WikiText-103, Long Range Arena, and ETT time-series benchmarks, we demonstrate significant efficiency gains with minimal performance degradation. Our analysis of pruning effects on Mamba's components reveals critical insights into the architecture's redundancy and robustness, enabling practical deployment in resource-constrained settings while broadening Mamba's applicability.
中文: 针对Mamba模型提出的非结构化剪枝框架通过梯度感知剪枝、迭代调度和全局优化,实现了高达70%的参数削减,同时保持超过95%的原始性能。
English: The proposed unstructured pruning framework for Mamba models reduces parameters by up to 70% while maintaining over 95% performance through gradient-aware pruning, iterative scheduling, and global optimization.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
Abstract:
State-space models (SSMs), particularly the Mamba architecture, have emerged as powerful alternatives to Transformers for sequence modeling, offering linear-time complexity and competitive performance across diverse tasks. However, their large parameter counts pose significant challenges for deployment in resource-constrained environments. We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70\% parameter reduction while retaining over 95\% of the original performance. Our approach integrates three key innovations: (1) a gradient-aware magnitude pruning technique that combines weight magnitude and gradient information to identify less critical parameters, (2) an iterative pruning schedule that gradually increases sparsity to maintain model stability, and (3) a global pruning strategy that optimizes parameter allocation across the entire model. Through extensive experiments on WikiText-103, Long Range Arena, and ETT time-series benchmarks, we demonstrate significant efficiency gains with minimal performance degradation. Our analysis of pruning effects on Mamba's components reveals critical insights into the architecture's redundancy and robustness, enabling practical deployment in resource-constrained settings while broadening Mamba's applicability.
中文: 针对Mamba模型提出的非结构化剪枝框架通过梯度感知剪枝、迭代调度和全局优化,实现了高达70%的参数削减,同时保持超过95%的原始性能。
English: The proposed unstructured pruning framework for Mamba models reduces parameters by up to 70% while maintaining over 95% performance through gradient-aware pruning, iterative scheduling, and global optimization.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
Abstract:
Integrating large language models (LLMs) as priors in reinforcement learning (RL) offers significant advantages but comes with substantial computational costs. We present a principled cache-efficient framework for posterior sampling with LLM-derived priors that dramatically reduces these costs while maintaining high performance. At the core of our approach is an adaptive caching mechanism, where cache parameters are meta-optimized using surrogate gradients derived from policy performance. This design enables efficient inference across both discrete text environments (e.g., TextWorld, ALFWorld) and continuous control domains (e.g., MuJoCo), achieving a 3.8--4.7$\times$ reduction in LLM queries and 4.0--12.0$\times$ lower median latencies (85--93\,ms on a consumer GPU) while retaining 96--98\% of uncached performance. Our theoretical analysis provides KL divergence bounds on approximation quality, validated empirically. The framework extends to offline RL, where our CQL-Prior variant improves performance by 14--29\% and reduces training time by 38--40\%. Extensive evaluations across a diverse suite of eight tasks demonstrate the generalizability and practical viability of LLM-guided RL in resource-constrained settings.
中文: 本文提出了一种缓存优化框架,在保持高性能的同时大幅降低大语言模型引导强化学习的计算成本,并在多种任务和领域中验证了其有效性。
English: This paper introduces a cache-efficient framework that significantly reduces computational costs for LLM-guided reinforcement learning while maintaining high performance, validated across diverse tasks and domains.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
Abstract:
Integrating large language models (LLMs) as priors in reinforcement learning (RL) offers significant advantages but comes with substantial computational costs. We present a principled cache-efficient framework for posterior sampling with LLM-derived priors that dramatically reduces these costs while maintaining high performance. At the core of our approach is an adaptive caching mechanism, where cache parameters are meta-optimized using surrogate gradients derived from policy performance. This design enables efficient inference across both discrete text environments (e.g., TextWorld, ALFWorld) and continuous control domains (e.g., MuJoCo), achieving a 3.8--4.7$\times$ reduction in LLM queries and 4.0--12.0$\times$ lower median latencies (85--93\,ms on a consumer GPU) while retaining 96--98\% of uncached performance. Our theoretical analysis provides KL divergence bounds on approximation quality, validated empirically. The framework extends to offline RL, where our CQL-Prior variant improves performance by 14--29\% and reduces training time by 38--40\%. Extensive evaluations across a diverse suite of eight tasks demonstrate the generalizability and practical viability of LLM-guided RL in resource-constrained settings.
中文: 本文提出了一种缓存优化框架,在保持高性能的同时大幅降低大语言模型引导强化学习的计算成本,并在多种任务和领域中验证了其有效性。
English: This paper introduces a cache-efficient framework that significantly reduces computational costs for LLM-guided reinforcement learning while maintaining high performance, validated across diverse tasks and domains.

Authors:Xia Du, Jiajie Zhu, Jizhe Zhou, Chi-man Pun, Zheng Lin, Cong Wu, Zhe Chen, Jun Luo
Title: DP-TRAE: A Dual-Phase Merging Transferable Reversible Adversarial Example for Image Privacy Protection
Abstract:
In the field of digital security, Reversible Adversarial Examples (RAE) combine adversarial attacks with reversible data hiding techniques to effectively protect sensitive data and prevent unauthorized analysis by malicious Deep Neural Networks (DNNs). However, existing RAE techniques primarily focus on white-box attacks, lacking a comprehensive evaluation of their effectiveness in black-box scenarios. This limitation impedes their broader deployment in complex, dynamic environments. Further more, traditional black-box attacks are often characterized by poor transferability and high query costs, significantly limiting their practical applicability. To address these challenges, we propose the Dual-Phase Merging Transferable Reversible Attack method, which generates highly transferable initial adversarial perturbations in a white-box model and employs a memory augmented black-box strategy to effectively mislead target mod els. Experimental results demonstrate the superiority of our approach, achieving a 99.0% attack success rate and 100% recovery rate in black-box scenarios, highlighting its robustness in privacy protection. Moreover, we successfully implemented a black-box attack on a commercial model, further substantiating the potential of this approach for practical use.
中文摘要:本研究提出的双阶段融合可逆攻击方法通过白盒模型生成高迁移性扰动并采用记忆增强黑盒策略,有效解决了现有可逆对抗技术在黑盒场景中的局限性,在黑盒环境中实现了99%的攻击成功率和100%恢复率。
English Summary: The proposed Dual-Phase Merging Transferable Reversible Attack method overcomes limitations in existing reversible adversarial techniques by generating highly transferable perturbations through white-box models and employing memory-augmented black-box strategies, achieving 99% attack success and 100% recovery rates in black-box scenarios.

Authors:Hao Xu, Yuntian Chen, Rui Cao, Tianning Tang, Mengge Du, Jian Li, Adrian H. Callaghan, Dongxiao Zhang
Title: Generative Discovery of Partial Differential Equations by Learning from Math Handbooks
Abstract:
Data driven discovery of partial differential equations (PDEs) is a promising approach for uncovering the underlying laws governing complex systems. However, purely data driven techniques face the dilemma of balancing search space with optimization efficiency. This study introduces a knowledge guided approach that incorporates existing PDEs documented in a mathematical handbook to facilitate the discovery process. These PDEs are encoded as sentence like structures composed of operators and basic terms, and used to train a generative model, called EqGPT, which enables the generation of free form PDEs. A loop of generation evaluation optimization is constructed to autonomously identify the most suitable PDE. Experimental results demonstrate that this framework can recover a variety of PDE forms with high accuracy and computational efficiency, particularly in cases involving complex temporal derivatives or intricate spatial terms, which are often beyond the reach of conventional methods. The approach also exhibits generalizability to irregular spatial domains and higher dimensional settings. Notably, it succeeds in discovering a previously unreported PDE governing strongly nonlinear surface gravity waves propagating toward breaking, based on real world experimental data, highlighting its applicability to practical scenarios and its potential to support scientific discovery.
本研究提出了一种知识引导的框架EqGPT,通过整合已有偏微分方程来高效生成和优化自由形式方程,在恢复复杂偏微分方程方面表现出高精度,并利用真实数据成功识别出非线性表面重力波的新方程。
This study introduces a knowledge-guided framework, EqGPT, which integrates documented PDEs to efficiently generate and optimize free-form equations, demonstrating high accuracy in recovering complex PDEs and successfully identifying a novel equation for nonlinear surface gravity waves using real-world data.

Authors:Ming Li, Lin Li, Xiaohui Tao, Dong Zhang, Jimmy Xiangji Huang
Title: Divide-and-Conquer: Cold-Start Bundle Recommendation via Mixture of Diffusion Experts
Abstract:
Cold-start bundle recommendation focuses on modeling new bundles with insufficient information to provide recommendations. Advanced bundle recommendation models usually learn bundle representations from multiple views (e.g., interaction view) at both the bundle and item levels. Consequently, the cold-start problem for bundles is more challenging than that for traditional items due to the dual-level multi-view complexity. In this paper, we propose a novel Mixture of Diffusion Experts (MoDiffE) framework, which employs a divide-and-conquer strategy for cold-start bundle recommendation and follows three steps:(1) Divide: The bundle cold-start problem is divided into independent but similar sub-problems sequentially by level and view, which can be summarized as the poor representation of feature-missing bundles in prior-embedding models. (2) Conquer: Beyond prior-embedding models that fundamentally provide the embedded representations, we introduce a diffusion-based method to solve all sub-problems in a unified way, which directly generates diffusion representations using diffusion models without depending on specific features. (3) Combine: A cold-aware hierarchical Mixture of Experts (MoE) is employed to combine results of the sub-problems for final recommendations, where the two models for each view serve as experts and are adaptively fused for different bundles in a multi-layer manner. Additionally, MoDiffE adopts a multi-stage decoupled training pipeline and introduces a cold-start gating augmentation method to enable the training of gating for cold bundles. Through extensive experiments on three real-world datasets, we demonstrate that MoDiffE significantly outperforms existing solutions in handling cold-start bundle recommendation. It achieves up to a 0.1027 absolute gain in Recall@20 in cold-start scenarios and up to a 47.43\% relative improvement in all-bundle scenarios.
Chinese: 本文提出了混合扩散专家(MoDiffE)框架,通过分治策略处理冷启动捆绑推荐问题,利用扩散模型生成表示并自适应融合多视角结果,在实验中显著提升了推荐性能。
English: This paper introduces the Mixture of Diffusion Experts (MoDiffE) framework, which uses a divide-and-conquer approach to enhance cold-start bundle recommendations by generating diffusion representations and adaptively combining multi-view results, achieving significant performance improvements in experiments.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: HMAE: Self-Supervised Few-Shot Learning for Quantum Spin Systems
Abstract:
Quantum machine learning for spin and molecular systems faces critical challenges of scarce labeled data and computationally expensive simulations. To address these limitations, we introduce Hamiltonian-Masked Autoencoding (HMAE), a novel self-supervised framework that pre-trains transformers on unlabeled quantum Hamiltonians, enabling efficient few-shot transfer learning. Unlike random masking approaches, HMAE employs a physics-informed strategy based on quantum information theory to selectively mask Hamiltonian terms based on their physical significance. Experiments on 12,500 quantum Hamiltonians (60% real-world, 40% synthetic) demonstrate that HMAE achieves 85.3% $\pm$ 1.5% accuracy in phase classification and 0.15 $\pm$ 0.02 eV MAE in ground state energy prediction with merely 10 labeled examples - a statistically significant improvement (p < 0.01) over classical graph neural networks (78.1% $\pm$ 2.1%) and quantum neural networks (76.8% $\pm$ 2.3%). Our method's primary advantage is exceptional sample efficiency - reducing required labeled examples by 3-5x compared to baseline methods - though we emphasize that ground truth values for fine-tuning and evaluation still require exact diagonalization or tensor networks. We explicitly acknowledge that our current approach is limited to small quantum systems (specifically limited to 12 qubits during training, with limited extension to 16-20 qubits in testing) and that, while promising within this regime, this size restriction prevents immediate application to larger systems of practical interest in materials science and quantum chemistry.
中文: 本研究提出了哈密顿量掩码自编码(HMAE)方法,通过基于物理意义选择性掩蔽哈密顿量项的自监督框架,显著提升了量子系统的少样本学习效率,在极小标注数据下实现更高精度,但当前仅适用于小规模量子系统。
English: The study introduces Hamiltonian-Masked Autoencoding (HMAE), a self-supervised framework that enhances few-shot learning for quantum systems by selectively masking Hamiltonian terms based on physical significance, achieving superior accuracy with minimal labeled data but remaining limited to small-scale applications.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: HMAE: Self-Supervised Few-Shot Learning for Quantum Spin Systems
Abstract:
Quantum machine learning for spin and molecular systems faces critical challenges of scarce labeled data and computationally expensive simulations. To address these limitations, we introduce Hamiltonian-Masked Autoencoding (HMAE), a novel self-supervised framework that pre-trains transformers on unlabeled quantum Hamiltonians, enabling efficient few-shot transfer learning. Unlike random masking approaches, HMAE employs a physics-informed strategy based on quantum information theory to selectively mask Hamiltonian terms based on their physical significance. Experiments on 12,500 quantum Hamiltonians (60% real-world, 40% synthetic) demonstrate that HMAE achieves 85.3% $\pm$ 1.5% accuracy in phase classification and 0.15 $\pm$ 0.02 eV MAE in ground state energy prediction with merely 10 labeled examples - a statistically significant improvement (p < 0.01) over classical graph neural networks (78.1% $\pm$ 2.1%) and quantum neural networks (76.8% $\pm$ 2.3%). Our method's primary advantage is exceptional sample efficiency - reducing required labeled examples by 3-5x compared to baseline methods - though we emphasize that ground truth values for fine-tuning and evaluation still require exact diagonalization or tensor networks. We explicitly acknowledge that our current approach is limited to small quantum systems (specifically limited to 12 qubits during training, with limited extension to 16-20 qubits in testing) and that, while promising within this regime, this size restriction prevents immediate application to larger systems of practical interest in materials science and quantum chemistry.
中文: 本研究提出了哈密顿量掩码自编码(HMAE)方法,通过基于物理意义选择性掩蔽哈密顿量项的自监督框架,显著提升了量子系统的少样本学习效率,在极小标注数据下实现更高精度,但当前仅适用于小规模量子系统。
English: The study introduces Hamiltonian-Masked Autoencoding (HMAE), a self-supervised framework that enhances few-shot learning for quantum systems by selectively masking Hamiltonian terms based on physical significance, achieving superior accuracy with minimal labeled data but remaining limited to small-scale applications.

Authors:Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu
Title: Accelerating Large Language Model Reasoning via Speculative Search
Abstract:
Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.
中文摘要:提出的推测搜索框架通过让小模型与大模型在思维和标记层面协作,并采用质量保留的拒绝机制,在保持与大模型相当推理质量的同时显著加速了大型语言模型的推理过程。
English Summary: The proposed Speculative Search framework accelerates LLM reasoning by using a small model to collaborate with a large model and employing a quality-preserving rejection mechanism, achieving significant speedup while maintaining reasoning quality comparable to the large model.

Authors:Zheng Lin, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Praneeth Vepakomma, Wei Ni, Jun Luo, Yue Gao
Title: HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models
Abstract:
Recently, large language models (LLMs) have achieved remarkable breakthroughs, revolutionizing the natural language processing domain and beyond. Due to immense parameter sizes, fine-tuning these models with private data for diverse downstream tasks has become mainstream. Though federated learning (FL) offers a promising solution for fine-tuning LLMs without sharing raw data, substantial computing costs hinder its democratization. Moreover, in real-world scenarios, private client devices often possess heterogeneous computing resources, further complicating LLM fine-tuning. To combat these challenges, we propose HSplitLoRA, a heterogeneous parameter-efficient fine-tuning (PEFT) framework built on split learning (SL) and low-rank adaptation (LoRA) fine-tuning, for efficiently fine-tuning LLMs on heterogeneous client devices. HSplitLoRA first identifies important weights based on their contributions to LLM training. It then dynamically configures the decomposition ranks of LoRA adapters for selected weights and determines the model split point according to varying computing budgets of client devices. Finally, a noise-free adapter aggregation mechanism is devised to support heterogeneous adapter aggregation without introducing noise. Extensive experiments demonstrate that HSplitLoRA outperforms state-of-the-art benchmarks in training accuracy and convergence speed.
中文摘要:HSplitLoRA是一种基于分裂学习和低秩自适应优化的创新框架,能在异构计算设备上高效微调大语言模型,在训练精度和收敛速度方面均优于现有最优方法。
English Summary: HSplitLoRA is a novel framework that combines split learning and LoRA fine-tuning to efficiently adapt large language models on devices with varying computing resources, achieving superior accuracy and faster convergence.

Authors:Tommaso Bianchi, Alessandro Brighente, Mauro Conti, Edoardo Pavan
Title: SoK: Stealing Cars Since Remote Keyless Entry Introduction and How to Defend From It
Abstract:
Remote Keyless Entry (RKE) systems have been the target of thieves since their introduction in automotive industry. Robberies targeting vehicles and their remote entry systems are booming again without a significant advancement from the industrial sector being able to protect against them. Researchers and attackers continuously play cat and mouse to implement new methodologies to exploit weaknesses and defense strategies for RKEs. In this fragment, different attacks and defenses have been discussed in research and industry without proper bridging. In this paper, we provide a Systematization Of Knowledge (SOK) on RKE and Passive Keyless Entry and Start (PKES), focusing on their history and current situation, ranging from legacy systems to modern web-based ones. We provide insight into vehicle manufacturers' technologies and attacks and defense mechanisms involving them. To the best of our knowledge, this is the first comprehensive SOK on RKE systems, and we address specific research questions to understand the evolution and security status of such systems. By identifying the weaknesses RKE still faces, we provide future directions for security researchers and companies to find viable solutions to address old attacks, such as Relay and RollJam, as well as new ones, like API vulnerabilities.
中文摘要:本文首次对遥控钥匙进入系统进行了全面的知识体系化研究,通过分析从传统到现代系统的安全演变过程,揭示了持续存在的安全漏洞,并为解决传统及新型威胁提出了未来的研究方向。
English Summary: This paper presents the first comprehensive Systematization of Knowledge on Remote Keyless Entry systems, analyzing their security evolution from legacy to modern implementations while identifying persistent vulnerabilities and proposing future research directions to address both traditional and emerging threats.

Authors:Yu Mao, Jingzong Li, Jun Wang, Hong Xu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue
Title: Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTs
Abstract:
Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy encoding and model switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.
中文:提出的Easz框架通过将计算负担转移至服务器,并采用基于补丁擦除的算法和轻量级变换器,有效解决了神经图像压缩在边缘设备上的应用难题,实现了高效且高质量的重建。
English: The proposed Easz framework addresses neural image compression challenges by shifting computational load to the server and using a patch-erase algorithm with a lightweight transformer for efficient, high-quality reconstruction on edge devices.

Authors:Joe Harrison, Peter A. N. Bosman, Tanja Alderliesten
Title: Thinking Outside the Template with Modular GP-GOMEA
Abstract:
The goal in Symbolic Regression (SR) is to discover expressions that accurately map input to output data. Because often the intent is to understand these expressions, there is a trade-off between accuracy and the interpretability of expressions. GP-GOMEA excels at producing small SR expressions (increasing the potential for interpretability) with high accuracy, but requires a fixed tree template, which limits the types of expressions that can be evolved. This paper presents a modular representation for GP-GOMEA that allows multiple trees to be evolved simultaneously that can be used as (functional) subexpressions. While each tree individually is constrained to a (small) fixed tree template, the final expression, if expanded, can exhibit a much larger structure. Furthermore, the use of subexpressions decomposes the original regression problem and opens the possibility for enhanced interpretability through the piece-wise understanding of small subexpressions. We compare the performance of GP-GOMEA with and without modular templates on a variety of datasets. We find that our proposed approach generally outperforms single-template GP-GOMEA and can moreover uncover ground-truth expressions underlying synthetic datasets with modular subexpressions at a faster rate than GP-GOMEA without modular subexpressions.
Chinese: 本文为符号回归中的GP-GOMEA提出了一种模块化表示法,通过同时演化多个小树作为子表达式来提高准确性和可解释性,其性能优于单模板方法。
English: This paper introduces a modular representation for GP-GOMEA in Symbolic Regression, enabling the evolution of multiple small trees as subexpressions to enhance both accuracy and interpretability, outperforming the single-template approach.

Authors:Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Ye Yuan, Guoren Wang, Lei Cao
Title: Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Abstract:
Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
中文摘要:该研究提出了一种两阶段噪声学习框架,通过动态加权损失函数和创新的"错误事件"指标解决深度神经网络在噪声监督下的泛化问题,在提升性能的同时显著降低了计算成本并增强了模型可扩展性。
English Summary: The proposed two-stage framework addresses noisy supervision in deep neural networks by using a dynamically weighted loss and a novel "wrong event" metric, achieving superior performance with reduced computational time and enhanced scalability.

Authors:Yijie Hong, Xiaofei Yin, Xinzhong Wang, Yi Tu, Ya Guo, Sufeng Duan, Weiqiang Wang, Lingyong Fang, Depeng Wang, Huijia Zhu
Title: Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Abstract:
Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.
中文摘要:大型视觉语言模型在融入专业知识时面临灾难性遗忘的困境,而提出的结构化对话微调方法通过三阶段对话结构,在强化基础能力的同时有效注入领域知识,实现了专业知识获取与通用能力保持的平衡。
English Summary: Large Vision Language Models face challenges in incorporating specialized knowledge without forgetting their foundational abilities, but the proposed Structured Dialogue Fine-Tuning (SDFT) method effectively injects domain expertise while preserving core visual-linguistic skills through a three-phase dialogue structure.

Authors:Luca Tognoni, Neil Reichlin, Edoardo Ghignone, Nicolas Baumann, Steven Marty, Liam Boyle, Michele Magno
Title: DTR: Delaunay Triangulation-based Racing for Scaled Autonomous Racing
Abstract:
Reactive controllers for autonomous racing avoid the computational overhead of full ee-Think-Act autonomy stacks by directly mapping sensor input to control actions, eliminating the need for localization and planning. A widely used reactive strategy is FTG, which identifies gaps in LiDAR range measurements and steers toward a chosen one. While effective on fully bounded circuits, FTG fails in scenarios with incomplete boundaries and is prone to driving into dead-ends, known as FTG-traps. This work presents DTR, a reactive controller that combines Delaunay triangulation, from raw LiDAR readings, with track boundary segmentation to extract a centerline while systematically avoiding FTG-traps. Compared to FTG, the proposed method achieves lap times that are 70\% faster and approaches the performance of map-dependent methods. With a latency of 8.95 ms and CPU usage of only 38.85\% on the robot's OBC, DTR is real-time capable and has been successfully deployed and evaluated in field experiments.
中文: DTR是一种反应式控制器,结合Delaunay三角剖分和赛道边界分割来避免FTG陷阱,实现了70%的圈速提升,同时具备低延迟和低CPU占用率。
English: DTR is a reactive controller that uses Delaunay triangulation and boundary segmentation to avoid FTG-traps, achieving 70% faster lap times with low latency and CPU usage.

Authors:Ying Yang, Jie Zhang, Xiao Lv, Di Lin, Tao Xiang, Qing Guo
Title: Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Abstract:
While adversarial attacks on vision-and-language pretraining (VLP) models have been explored, generating natural adversarial samples crafted through realistic and semantically meaningful perturbations remains an open challenge. Existing methods, primarily designed for classification tasks, struggle when adapted to VLP models due to their restricted optimization spaces, leading to ineffective attacks or unnatural artifacts. To address this, we propose \textbf{LightD}, a novel framework that generates natural adversarial samples for VLP models via semantically guided relighting. Specifically, LightD leverages ChatGPT to propose context-aware initial lighting parameters and integrates a pretrained relighting model (IC-light) to enable diverse lighting adjustments. LightD expands the optimization space while ensuring perturbations align with scene semantics. Additionally, gradient-based optimization is applied to the reference lighting image to further enhance attack effectiveness while maintaining visual naturalness. The effectiveness and superiority of the proposed LightD have been demonstrated across various VLP models in tasks such as image captioning and visual question answering.
中文: LightD是一种新颖的框架,通过语义引导的重光照技术为视觉语言预训练模型生成自然对抗样本,利用ChatGPT提出上下文感知的初始光照参数并集成预训练重光照模型,在图像描述和视觉问答等任务中确保视觉自然性和攻击有效性。
English: LightD is a novel framework that generates natural adversarial samples for vision-and-language pretraining models by using semantically guided relighting, leveraging ChatGPT for context-aware lighting parameters and a pretrained relighting model to ensure visual naturalness and attack effectiveness across tasks like image captioning and visual question answering.

Authors:Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, Lifu Huang
Title: R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation
Abstract:
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io
Chinese Summary: R2I-Bench作为专门评估文本生成图像推理能力的新基准,通过多维度测试揭示了现有模型在生成逼真图像的同时仍存在明显的推理缺陷。
English Summary: R2I-Bench is a new benchmark designed to evaluate reasoning capabilities in text-to-image generation, revealing current models' significant limitations despite their photorealism through comprehensive testing across multiple reasoning categories.

Authors:Jonas Kulhanek, Marie-Julie Rakotosaona, Fabian Manhardt, Christina Tsalicoglou, Michael Niemeyer, Torsten Sattler, Songyou Peng, Federico Tombari
Title: LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering
Abstract:
In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.
中文: 本研究提出了一种创新的3D高斯泼溅细节层次方法,通过层级选择和空间分块技术大幅降低渲染时间与GPU内存占用,在大规模数据集上实现了领先的渲染性能。
English: This study introduces a novel level-of-detail method for 3D Gaussian Splatting that significantly reduces rendering time and GPU memory usage through hierarchical selection and spatial partitioning, achieving state-of-the-art performance on large-scale datasets.

Authors:Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu
Title: AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models
Abstract:
The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks but often suffer from overthinking, generating unnecessarily long chain-of-thought (CoT) reasoning paths for easy reasoning questions, thereby increasing inference cost and latency. Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning. However, they lack the flexibility to adapt CoT length dynamically based on question complexity. In this paper, we propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path based on the complexity of the reasoning question. AutoL2S enables a learned paradigm, in which LLMs themselves can decide when longer reasoning is necessary and when shorter reasoning suffices, by training on data annotated with our proposed method, which includes both long and short CoT paths and a special token. We then use token to indicate when the model can skip generating lengthy CoT reasoning. This proposed annotation strategy can enhance the LLMs' ability to generate shorter CoT reasoning paths with improved quality after training. Extensive evaluation results show that AutoL2S reduces the length of reasoning generation by up to 57% without compromising performance, demonstrating the effectiveness of AutoL2S for scalable and efficient LLM reasoning.
中文: AutoL2S是一种动态框架,使大语言模型能根据问题复杂度自适应压缩推理路径,在不损失性能的情况下将生成长度减少高达57%。
English: AutoL2S is a dynamic framework that enables large language models to adaptively compress reasoning paths based on question complexity, reducing generation length by up to 57% without performance loss.

Authors:Wenjun Lu, Haodong Chen, Anqi Yi, Yuk Ying Chung, Zhiyong Wang, Kun Hu
Title: Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
Abstract:
Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct realistic images from a set of posed input views. However, reconstruction quality degrades significantly under sparse-view conditions due to limited geometric cues. Existing methods, such as Neural Radiance Fields (NeRF) and the more recent 3D Gaussian Splatting (3DGS), often suffer from blurred details and structural artifacts when trained with insufficient views. Recent works have identified the quality of rendered depth as a key factor in mitigating these artifacts, as it directly affects geometric accuracy and view consistency. In this paper, we address these challenges by introducing Hierarchical Depth-Guided Splatting (HDGS), a depth supervision framework that progressively refines geometry from coarse to fine levels. Central to HDGS is a novel Cascade Pearson Correlation Loss (CPCL), which aligns rendered and estimated monocular depths across multiple spatial scales. By enforcing multi-scale depth consistency, our method substantially improves structural fidelity in sparse-view scenarios. Extensive experiments on the LLFF and DTU benchmarks demonstrate that HDGS achieves state-of-the-art performance under sparse-view settings while maintaining efficient and high-quality rendering
中文: 本文提出的分层深度引导溅射(HDGS)框架通过多尺度深度一致性逐步优化几何结构,有效提升了稀疏视图下的三维重建质量,在基准测试中达到了领先性能。
English: This paper introduces Hierarchical Depth-Guided Splatting (HDGS), a framework that enhances sparse-view 3D reconstruction by progressively refining geometry through multi-scale depth consistency, achieving state-of-the-art results on benchmarks.

Authors:Xiangyu Chen, Jing Liu, Ye Wang, Matthew Brand, Pu, Wang, Toshiaki Koike-Akino
Title: TuneComp: Joint Fine-tuning and Compression for Large Foundation Models
Abstract:
To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.
Chinese: 本研究提出一种联合微调与压缩的方法,通过逐步将模型蒸馏为剪枝后的低秩结构,显著优于顺序压缩方法的性能。
English: This study introduces a method that jointly fine-tunes and compresses models by gradually distilling them into a pruned low-rank structure, significantly outperforming sequential approaches in performance.

Authors:Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K. Du, Shu Cheng, Zehuan Yuan, Xinglong Wu
Title: DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
Abstract:
This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.
中文: DetailFlow是一种从粗到细的自回归图像生成方法,通过逐级细节预测策略,以更少的标记和更快的推理速度合成高质量图像,在基准测试中超越了现有模型。
English: DetailFlow is a coarse-to-fine autoregressive image generation method that uses a next-detail prediction strategy to efficiently synthesize high-quality images with fewer tokens and faster inference, outperforming existing models on benchmarks.

Authors:Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet Üstün, Matthias Gallé, Marzieh Fadaee, Sara Hooker
Title: The Multilingual Divide and Its Impact on Global AI Safety
Abstract:
Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.
中文: 大语言模型在非主流语言上的能力与安全性存在显著差距,需通过多语言数据集建设和透明化研究来应对全球人工智能安全风险。
English: Large language models exhibit significant disparities in capabilities and safety for non-dominant languages, requiring multilingual dataset development and transparent research to address global AI safety risks.

Authors:Ziming Wang, Zeyu Shi, Haoyi Zhou, Shiqi Gao, Qingyun Sun, Jianxin Li
Title: Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?
Abstract:
Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs' prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs' prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs' prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs' encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model's prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57\% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.
中文摘要:微调后的大语言模型常因先验知识导致校准不佳,而提出的CogCalib框架通过根据模型已有知识实施针对性学习策略,显著提升了校准效果。
English Summary: Fine-tuned LLMs often show poor calibration due to their prior knowledge, but the proposed CogCalib framework significantly improves calibration by applying targeted learning strategies based on the model's existing knowledge.

Authors:Xiangyu Sun, Runnan Chen, Mingming Gong, Dong Xu, Tongliang Liu
Title: Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting
Abstract:
Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.
中文:Intern-GS提出了一种利用视觉基础模型增强稀疏视角高斯溅射的新方法,通过引导初始化和优化过程,实现了卓越的场景重建效果。
English: Intern-GS introduces a novel method that utilizes vision foundation models to enhance sparse-view Gaussian Splatting, achieving superior scene reconstruction by guiding initialization and optimization processes.

Authors:Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin
Title: OccLE: Label-Efficient 3D Semantic Occupancy Prediction
Abstract:
3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets.
中文摘要:OccLE是一种标签高效的3D语义占据预测方法,通过解耦语义与几何学习任务,并融合二维基础模型与跨模态数据,仅需10%体素标注即可实现优越性能。
English Summary: OccLE is a label-efficient 3D semantic occupancy prediction method that decouples semantic and geometric learning, achieving competitive performance with only 10% of voxel annotations by leveraging 2D foundation models and cross-modal fusion.

Authors:Lei Tian, Xiaomin Li, Liqian Ma, Hao Yin, Zirui Zheng, Hefei Huang, Taiqing Li, Huchuan Lu, Xu Jia
Title: CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting
Abstract:
Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at https://epsilontl.github.io/CCL-LGS/.
Chinese: 3D语义理解的最新进展面临跨视图不一致性的挑战,但提出的CCL-LGS框架通过整合多视图语义线索和对比学习,有效解决了这些问题,提高了准确性并减少了伪影。
English: Recent progress in 3D semantic understanding faces challenges from cross-view inconsistencies, but the proposed CCL-LGS framework effectively addresses these by integrating multi-view semantic cues and contrastive learning to enhance accuracy and reduce artifacts.

Authors:Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Title: KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing
Abstract:
Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.
中文摘要:KnowTrace是一种检索增强生成框架,通过将检索信息组织成结构化知识图谱来减轻上下文过载,并利用知识回溯机制实现自我提升的多步推理,在多项多跳问答基准测试中均优于现有方法。
English Summary: KnowTrace is a retrieval-augmented generation framework that organizes retrieved information into structured knowledge graphs to reduce context overload and enhance multi-step reasoning through self-bootstrapping, demonstrating superior performance on multi-hop question answering benchmarks.

Authors:Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Moontae Lee
Title: SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Abstract:
As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.
Chinese: 针对现有大语言模型安全对齐方法的复杂性,本文提出SafeDPO算法,通过单阶段策略学习直接优化安全目标,仅需对标准DPO进行微小改动,在安全性和人类偏好对齐方面均展现出与先进方法相当的竞争力。
English: To address the complexity of existing safety alignment methods for LLMs, this paper introduces SafeDPO, a simplified algorithm that optimizes safety in a single policy learning stage with minimal modifications to standard DPO, achieving competitive performance in both safety and human preference alignment.

Authors:Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Abstract:
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
中文摘要:针对消除攻击的防御方法通过扩展拒绝数据集进行微调,将拒绝信号分散到多个标记位置,使模型在遭受攻击时仍能保持高拒绝率,同时确保安全性和实用性不受影响。
English Summary: The proposed defense against abliteration attacks involves fine-tuning models on an extended-refusal dataset that spreads refusal signals across multiple tokens, maintaining high refusal rates with minimal performance drops while preserving safety and utility.

Authors:Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
Title: An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Abstract:
Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.
中文摘要:针对消除攻击的防御方法通过扩展拒绝数据集进行微调,将拒绝信号分散到多个标记位置,使模型在遭受攻击时仍能保持高拒绝率,同时确保安全性和实用性不受影响。
English Summary: The proposed defense against abliteration attacks involves fine-tuning models on an extended-refusal dataset that spreads refusal signals across multiple tokens, maintaining high refusal rates with minimal performance drops while preserving safety and utility.

Authors:Toshiaki Koike-Akino, Jing Liu, Ye Wang
Title: $μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
Abstract:
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $μ$-MoE. Several experiments demonstrate that $μ$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
中文摘要:激活感知剪枝通过无需重训练的μ-MoE方法,动态适应不同任务和提示的结构化稀疏性,有效应对领域偏移并降低推理计算复杂度。
English Summary: Activation-aware pruning adapts to each prompt for efficient inference without retraining, formulated as μ-MoE to handle domain shifts and reduce computational complexity dynamically.

Authors:Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu, Wang, Matthew Brand
Title: LatentLLM: Attention-Aware Joint Tensor Compression
Abstract:
Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
中文: 该框架通过全局注意力感知的联合张量分解将大型模型转换为降维潜在结构,在实现计算和内存效率的同时,显著提升了相对于现有压缩方法的模型精度。
English: The proposed framework employs a global attention-aware joint tensor decomposition to convert large models into reduced-dimension latent structures, significantly enhancing accuracy over existing compression methods while achieving computational and memory efficiency.

Authors:Runze Li, Siyu Wu, Jun Wang, Wei Zhang
Title: CIKT: A Collaborative and Iterative Knowledge Tracing Framework with Large Language Models
Abstract:
Knowledge Tracing (KT) aims to model a student's learning state over time and predict their future performance. However, traditional KT methods often face challenges in explainability, scalability, and effective modeling of complex knowledge dependencies. While Large Language Models (LLMs) present new avenues for KT, their direct application often struggles with generating structured, explainable student representations and lacks mechanisms for continuous, task-specific refinement. To address these gaps, we propose Collaborative Iterative Knowledge Tracing (CIKT), a framework that harnesses LLMs to enhance both prediction accuracy and explainability. CIKT employs a dual-component architecture: an Analyst generates dynamic, explainable user profiles from student historical responses, and a Predictor utilizes these profiles to forecast future performance. The core of CIKT is a synergistic optimization loop. In this loop, the Analyst is iteratively refined based on the predictive accuracy of the Predictor, which conditions on the generated profiles, and the Predictor is subsequently retrained using these enhanced profiles. Evaluated on multiple educational datasets, CIKT demonstrates significant improvements in prediction accuracy, offers enhanced explainability through its dynamically updated user profiles, and exhibits improved scalability. Our work presents a robust and explainable solution for advancing knowledge tracing systems, effectively bridging the gap between predictive performance and model transparency.
中文摘要:提出的协作迭代知识追踪(CIKT)框架通过结合大语言模型与双组件架构,迭代优化可解释的学生画像和预测精度,有效提升了知识追踪的性能与透明度。
English Summary: The proposed Collaborative Iterative Knowledge Tracing (CIKT) framework enhances knowledge tracing by combining Large Language Models with a dual-component architecture that iteratively refines explainable student profiles and prediction accuracy.

Authors:Geeta Chandra Raju Bethala, Hao Huang, Niraj Pudasaini, Abdullah Mohamed Ali, Shuaihang Yuan, Congcong Wen, Anthony Tzes, Yi Fang
Title: H2-COMPACT: Human-Humanoid Co-Manipulation via Adaptive Contact Trajectory Policies
Abstract:
We present a hierarchical policy-learning framework that enables a legged humanoid to cooperatively carry extended loads with a human partner using only haptic cues for intent inference. At the upper tier, a lightweight behavior-cloning network consumes six-axis force/torque streams from dual wrist-mounted sensors and outputs whole-body planar velocity commands that capture the leader's applied forces. At the lower tier, a deep-reinforcement-learning policy, trained under randomized payloads (0-3 kg) and friction conditions in Isaac Gym and validated in MuJoCo and on a real Unitree G1, maps these high-level twists to stable, under-load joint trajectories. By decoupling intent interpretation (force -> velocity) from legged locomotion (velocity -> joints), our method combines intuitive responsiveness to human inputs with robust, load-adaptive walking. We collect training data without motion-capture or markers, only synchronized RGB video and F/T readings, employing SAM2 and WHAM to extract 3D human pose and velocity. In real-world trials, our humanoid achieves cooperative carry-and-move performance (completion time, trajectory deviation, velocity synchrony, and follower-force) on par with a blindfolded human-follower baseline. This work is the first to demonstrate learned haptic guidance fused with full-body legged control for fluid human-humanoid co-manipulation. Code and videos are available on the H2-COMPACT website.
中文摘要:本研究提出了一种分层策略学习框架,使腿式人形机器人仅通过触觉线索就能与人类伙伴协同搬运物体,实现了直观响应与负载自适应行走的稳健结合。
English Summary: This study introduces a hierarchical policy-learning framework that enables a legged humanoid to cooperatively carry loads with a human partner using only haptic cues, combining intuitive responsiveness with robust, load-adaptive walking.

Authors:Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank
Title: Refusal Direction is Universal Across Safety-Aligned Languages
Abstract:
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.
中文: 研究发现,从英语中提取的单一拒绝方向无需微调即可有效绕过14种语言的安全拒绝,揭示了跨语言普遍性,并为多语言防御机制提供了重要见解。
English: Researchers discovered that a single refusal direction extracted from English can effectively bypass safety refusals across 14 languages without fine-tuning, revealing cross-lingual universality and providing insights for multilingual defense mechanisms.

Authors:Junzhe Jiang, Nan Song, Jingyu Li, Xiatian Zhu, Li Zhang
Title: RealEngine: Simulating Autonomous Driving in Realistic Context
Abstract:
Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.
中文摘要:RealEngine是一种创新的驾驶仿真框架,通过融合三维场景重建与新颖视角合成技术,构建具备多模态感知的高拟真环境,为驾驶智能体提供涵盖非反应式测试、安全评估与多智能体交互的全面性能验证平台。
English Summary: RealEngine is a comprehensive driving simulation framework that integrates 3D scene reconstruction and novel view synthesis to create realistic, multi-modal environments for thorough evaluation of driving agents across diverse scenarios.

Authors:Meng Yan, Cai Xu, Xujing Wang, Ziyu Guan, Wei Zhao, Yuhang Zhou
Title: Conf-GNNRec: Quantifying and Calibrating the Prediction Confidence for GNN-based Recommendation Methods
Abstract:
Recommender systems based on graph neural networks perform well in tasks such as rating and ranking. However, in real-world recommendation scenarios, noise such as user misuse and malicious advertisement gradually accumulates through the message propagation mechanism. Even if existing studies mitigate their effects by reducing the noise propagation weights, the severe sparsity of the recommender system still leads to the low-weighted noisy neighbors being mistaken as meaningful information, and the prediction result obtained based on the polluted nodes is not entirely trustworthy. Therefore, it is crucial to measure the confidence of the prediction results in this highly noisy framework. Furthermore, our evaluation of the existing representative GNN-based recommendation shows that it suffers from overconfidence. Based on the above considerations, we propose a new method to quantify and calibrate the prediction confidence of GNN-based recommendations (Conf-GNNRec). Specifically, we propose a rating calibration method that dynamically adjusts excessive ratings to mitigate overconfidence based on user personalization. We also design a confidence loss function to reduce the overconfidence of negative samples and effectively improve recommendation performance. Experiments on public datasets demonstrate the validity of Conf-GNNRec in prediction confidence and recommendation performance.
中文: 基于图神经网络的推荐系统因噪声积累和预测过度自信而面临挑战,为此提出的Conf-GNNRec方法通过动态评分校准和置信度损失函数来量化并修正预测置信度,有效提升了系统的可靠性和推荐性能。
English: Graph neural network-based recommender systems face challenges from accumulated noise and overconfidence in predictions, prompting the development of Conf-GNNRec, which calibrates confidence through dynamic rating adjustments and a specialized loss function to enhance reliability and performance.

Authors:Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji
Title: SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet
Abstract:
Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/
Chinese Summary: SpecMaskFoley通过引入频率感知时序特征对齐器,将预训练的SpecMaskGIT模型适配于ControlNet框架,显著提升了基于视频的拟音合成性能,甚至优于从头训练的基线模型。
English Summary: SpecMaskFoley bridges the performance gap in video-synchronized foley synthesis by adapting the pretrained SpecMaskGIT model through ControlNet with a frequency-aware temporal feature aligner, outperforming from-scratch models.

Authors:Congyuan Zhao, Lingwei Wei, Ziming Qin, Wei Zhou, Yunya Song, Songlin Hu
Title: MPPFND: A Dataset and Analysis of Detecting Fake News with Multi-Platform Propagation
Abstract:
Fake news spreads widely on social media, leading to numerous negative effects. Most existing detection algorithms focus on analyzing news content and social context to detect fake news. However, these approaches typically detect fake news based on specific platforms, ignoring differences in propagation characteristics across platforms. In this paper, we introduce the MPPFND dataset, which captures propagation structures across multiple platforms. We also describe the commenting and propagation characteristics of different platforms to show that their social contexts have distinct features. We propose a multi-platform fake news detection model (APSL) that uses graph neural networks to extract social context features from various platforms. Experiments show that accounting for cross-platform propagation differences improves fake news detection performance.
中文摘要:本研究通过引入多平台数据集和检测模型,利用图神经网络提取各平台独特的传播特征,有效提升了跨平台假新闻检测的性能。
English Summary: This study introduces a multi-platform dataset and a detection model that leverages graph neural networks to improve fake news detection by addressing the distinct propagation characteristics across different social media platforms.

Authors:Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du
Title: Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Abstract:
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.
中文: SDCV通过提取最具区分度的特征并消除数据集噪声,将LLM引导成功率提升了4-16%,同时在六个挑战性概念中保持了主题相关性。
English: SDCV enhances LLM steering by isolating the most discriminative features from dataset noise, boosting success rates by 4-16% across six concepts while preserving topic relevance.

Authors:Size Peng, Yin Xu, Guanli Yi, Cixiao Zhang, Dazhi He, Wenjun Zhang
Title: Movable Antenna Aided Full-Duplex ISAC System with Self-Interference Mitigation
Abstract:
Movable antenna (MA) has shown significant potential for improving the performance of integrated sensing and communication (ISAC) systems. In this paper, we model an MA-aided ISAC system operating in a communication full-duplex mono-static sensing framework. The self-interference channel is modeled as a function of the antenna position vectors under the near-field channel condition. We develop an optimization problem to maximize the weighted sum of downlink and uplink communication rates alongside the mutual information relevant to the sensing task. To address this highly non-convex problem, we employ the fractional programming (FP) method and propose an alternating optimization (AO)-based algorithm that jointly optimizes the beamforming, user power allocation, and antenna positions at the transceivers. Given the sensitivity of the AO-based algorithm to the initial antenna positions, a PSO-based algorithm is proposed to explore superior sub-optimal antenna positions within the feasible region. Numerical results indicate that the proposed algorithms enable the MA system to effectively leverage the antenna position flexibility for accurate beamforming in a complex ISAC scenario. This enhances the system's self-interference cancellation (SIC) capabilities and markedly improves its overall performance and reliability compared to conventional fixed-position antenna designs.
Chinese: 本文提出了一种可移动天线系统,通过分数规划和交替优化方法联合优化波束成形、功率分配及天线位置,有效提升了集成感知与通信系统的性能及自干扰消除能力,显著优于传统固定天线设计。
English: This paper introduces a movable antenna (MA) system to enhance integrated sensing and communication (ISAC) by optimizing beamforming, power allocation, and antenna positions using fractional programming and alternating optimization, significantly improving performance and self-interference cancellation over fixed antennas.

Authors:Yunho Jin, Gu-Yeon Wei, David Brooks
Title: The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Abstract:
Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work introduces test-time compute (TTC)-allocating additional computational resources during inference-as a compelling complement to conventional scaling strategies. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models without incurring additional pretraining costs.
中文摘要:在推理阶段分配测试时计算资源,相比传统模型扩展方法,能实现更优的精度-能耗平衡,尤其适用于复杂推理任务,无需额外预训练成本即可实现语言模型的可持续及自适应部署。
English Summary: Test-time compute allocation during inference offers superior accuracy-energy efficiency over traditional model scaling, especially for complex reasoning tasks, enabling sustainable and adaptable LLM deployment without extra pretraining costs.

Authors:Eugene Yang, Andrew Yates, Kathryn Ricci, Orion Weller, Vivek Chari, Benjamin Van Durme, Dawn Lawrie
Title: Rank-K: Test-Time Reasoning for Listwise Reranking
Abstract:
Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.
中文:Rank-K是一种列表式段落重排序模型,将检索效果较现有最优方法提升23%,并在多语言环境下高效处理复杂查询,实现可扩展性。
English: Rank-K is a listwise passage reranking model that enhances retrieval effectiveness by 23% over state-of-the-art methods while providing multilingual scalability for handling complex queries efficiently.

Authors:Pengzhou Cheng, Haowen Hu, Zheng Wu, Zongru Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu
Title: Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents
Abstract:
Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.
中文: 本研究提出AgentGhost框架,通过组合触发器对多模态大语言模型驱动的图形界面代理实施隐蔽后门攻击,在保持任务效用同时实现99.7%攻击成功率,并开发出将攻击率降至22.1%的防御方案。
English: This study introduces AgentGhost, a stealthy backdoor attack framework targeting MLLM-powered GUI agents that exploits composite triggers to achieve 99.7% attack accuracy while maintaining utility, and proposes a defense method reducing attacks to 22.1%.

Authors:Inder Pal Singh, Enjie Ghorbel, Anis Kacem, Djamila Aouada
Title: Domain Adaptation for Multi-label Image Classification: a Discriminator-free Approach
Abstract:
This paper introduces a discriminator-free adversarial-based approach termed DDA-MLIC for Unsupervised Domain Adaptation (UDA) in the context of Multi-Label Image Classification (MLIC). While recent efforts have explored adversarial-based UDA methods for MLIC, they typically include an additional discriminator subnet. Nevertheless, decoupling the classification and the discrimination tasks may harm their task-specific discriminative power. Herein, we address this challenge by presenting a novel adversarial critic directly derived from the task-specific classifier. Specifically, we employ a two-component Gaussian Mixture Model (GMM) to model both source and target predictions, distinguishing between two distinct clusters. Instead of using the traditional Expectation Maximization (EM) algorithm, our approach utilizes a Deep Neural Network (DNN) to estimate the parameters of each GMM component. Subsequently, the source and target GMM parameters are leveraged to formulate an adversarial loss using the Fréchet distance. The proposed framework is therefore not only fully differentiable but is also cost-effective as it avoids the expensive iterative process usually induced by the standard EM method. The proposed method is evaluated on several multi-label image datasets covering three different types of domain shift. The obtained results demonstrate that DDA-MLIC outperforms existing state-of-the-art methods in terms of precision while requiring a lower number of parameters. The code is made publicly available at github.com/cvi2snt/DDA-MLIC.
中文摘要:本文提出DDA-MLIC,一种用于多标签图像分类无监督领域适应的无判别器对抗方法,通过高斯混合模型和神经网络构建对抗损失,在减少参数量的同时实现了比现有方法更高的精确度。
English Summary: This paper presents DDA-MLIC, a discriminator-free adversarial method for unsupervised domain adaptation in multi-label image classification that uses a Gaussian Mixture Model and neural networks to create an adversarial loss, achieving higher precision with fewer parameters than existing approaches.

Authors:Yaqian Chen, Hanxue Gu, Haoyu Dong, Qihang Li, Yuwen Chen, Nicholas Konz, Lin Li, Maciej A. Mazurowski
Title: GuidedMorph: Two-Stage Deformable Registration for Breast MRI
Abstract:
Accurately registering breast MR images from different time points enables the alignment of anatomical structures and tracking of tumor progression, supporting more effective breast cancer detection, diagnosis, and treatment planning. However, the complexity of dense tissue and its highly non-rigid nature pose challenges for conventional registration methods, which primarily focus on aligning general structures while overlooking intricate internal details. To address this, we propose \textbf{GuidedMorph}, a novel two-stage registration framework designed to better align dense tissue. In addition to a single-scale network for global structure alignment, we introduce a framework that utilizes dense tissue information to track breast movement. The learned transformation fields are fused by introducing the Dual Spatial Transformer Network (DSTN), improving overall alignment accuracy. A novel warping method based on the Euclidean distance transform (EDT) is also proposed to accurately warp the registered dense tissue and breast masks, preserving fine structural details during deformation. The framework supports paradigms that require external segmentation models and with image data only. It also operates effectively with the VoxelMorph and TransMorph backbones, offering a versatile solution for breast registration. We validate our method on ISPY2 and internal dataset, demonstrating superior performance in dense tissue, overall breast alignment, and breast structural similarity index measure (SSIM), with notable improvements by over 13.01% in dense tissue Dice, 3.13% in breast Dice, and 1.21% in breast SSIM compared to the best learning-based baseline.
中文: 提出的GuidedMorph框架通过整合密集组织追踪和双重空间变换网络,在乳腺MR图像配准中实现了更优性能,相比现有方法将密集组织对齐精度提升了超过13%。
English: The proposed GuidedMorph framework achieves superior breast MR image registration by integrating dense tissue tracking and a dual spatial transformer network, demonstrating over 13% improvement in dense tissue alignment accuracy compared to existing methods.

Authors:Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
Title: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Abstract:
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.
中文:OSWorld-G是一个解决图形用户界面定位局限性的新基准,包含564个标注样本和400万条数据的Jedi数据集,显著提升了模型在复杂计算机任务上的性能和智能体能力。
English: OSWorld-G is a new benchmark addressing GUI grounding limitations with 564 annotated samples and the Jedi dataset of 4 million examples, significantly improving model performance and agent capabilities on complex computer tasks.

Authors:Piotr Borycki, Magdalena Trędowicz, Szymon Janusz, Jacek Tabor, Przemysław Spurek, Arkadiusz Lewicki, Łukasz Struski
Title: EPIC: Explanation of Pretrained Image Classification Networks via Prototype
Abstract:
Explainable AI (XAI) methods generally fall into two categories. Post-hoc approaches generate explanations for pre-trained models and are compatible with various neural network architectures. These methods often use feature importance visualizations, such as saliency maps, to indicate which input regions influenced the model's prediction. Unfortunately, they typically offer a coarse understanding of the model's decision-making process. In contrast, ante-hoc (inherently explainable) methods rely on specially designed model architectures trained from scratch. A notable subclass of these methods provides explanations through prototypes, representative patches extracted from the training data. However, prototype-based approaches have limitations: they require dedicated architectures, involve specialized training procedures, and perform well only on specific datasets. In this work, we propose EPIC (Explanation of Pretrained Image Classification), a novel approach that bridges the gap between these two paradigms. Like post-hoc methods, EPIC operates on pre-trained models without architectural modifications. Simultaneously, it delivers intuitive, prototype-based explanations inspired by ante-hoc techniques. To the best of our knowledge, EPIC is the first post-hoc method capable of fully replicating the core explanatory power of inherently interpretable models. We evaluate EPIC on benchmark datasets commonly used in prototype-based explanations, such as CUB-200-2011 and Stanford Cars, alongside large-scale datasets like ImageNet, typically employed by post-hoc methods. EPIC uses prototypes to explain model decisions, providing a flexible and easy-to-understand tool for creating clear, high-quality explanations.
中文: EPIC弥合了事后与事前可解释AI方法之间的差距,无需修改预训练模型架构即可提供基于原型的解释,直观展示模型的决策依据。
English: EPIC bridges the gap between post-hoc and ante-hoc explainable AI by providing prototype-based explanations for pre-trained models without architectural changes, offering intuitive insights into model decisions.

Authors:Wenjiao Feng, Rongxing Xiao, Zonghang Li, Hongfang Yu, Gang Sun, Long Luo, Mohsen Guizani, Qirong Ho, Steve Liu
Title: Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training
Abstract:
Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a cluster monitor to track resource and topology changes, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling among institutions. Experiments show that Chaos has substantially lower scale-out delay than Pollux, Elan, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 20ms. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.
中文: 本文提出Chaos系统,一种具备自我修复和自动扩展功能的多方分布式训练方案,通过多邻居状态复制、模型分片及对等协商协议,在节点和链路变动下实现稳健高效的训练,相比现有方法显著降低了扩展延迟和空闲时间。
English: This paper introduces Chaos, a self-healing and autoscaling system for multi-party distributed training that ensures robustness and efficiency under node and link churn by employing multi-neighbor state replication, model sharding, and peer negotiation protocols, significantly reducing delays and idle time compared to existing solutions.

Authors:Luyu Chen, Zeyu Zhang, Haoran Tan, Quanyu Dai, Hao Yang, Zhenhua Dong, Xu Chen
Title: Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Abstract:
LLMs have emerged as powerful evaluators in the LLM-as-a-Judge paradigm, offering significant efficiency and flexibility compared to human judgments. However, previous methods primarily rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations. This approach leads to information loss and decreases the reliability of evaluations. To address this limitation, we propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions. Specifically, we propose a distributional alignment objective based on KL divergence, combined with an auxiliary cross-entropy regularization to stabilize the training process. Furthermore, considering that empirical distributions may derive from limited human annotations, we incorporate adversarial training to enhance model robustness against distribution perturbations. Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods, with improved alignment quality, evaluation accuracy, and robustness.
中文摘要:本研究提出了一种新颖的训练框架,通过KL散度对齐和对抗训练将LLM生成的判断分布与人类经验分布相匹配,相比传统单点评估方法显著提升了评估准确性与鲁棒性。
English Summary: This study introduces a novel training framework that aligns LLM-generated judgment distributions with human distributions using KL divergence and adversarial training, significantly improving evaluation accuracy and robustness over traditional single-point methods.

Authors:Jianxiang Yu, Jiapeng Zhu, Hao Qian, Ziqi Liu, Zhiqiang Zhang, Xiang Li
Title: Relation-Aware Graph Foundation Model
Abstract:
In recent years, large language models (LLMs) have demonstrated remarkable generalization capabilities across various natural language processing (NLP) tasks. Similarly, graph foundation models (GFMs) have emerged as a promising direction in graph learning, aiming to generalize across diverse datasets through large-scale pre-training. However, unlike language models that rely on explicit token representations, graphs lack a well-defined unit for generalization, making it challenging to design effective pre-training strategies. In this work, we propose REEF, a novel framework that leverages relation tokens as the basic units for GFMs. Inspired by the token vocabulary in LLMs, we construct a relation vocabulary of relation tokens to store relational information within graphs. To accommodate diverse relations, we introduce two hypernetworks that adaptively generate the parameters of aggregators and classifiers in graph neural networks based on relation tokens. In addition, we design another hypernetwork to construct dataset-specific projectors and incorporate a dataset-level feature bias into the initial node representations, enhancing flexibility across different datasets with the same relation. Further, we adopt graph data augmentation and a mixed-dataset pre-training strategy, allowing REEF to capture relational diversity more effectively and exhibit strong generalization capabilities. Extensive experiments show that REEF significantly outperforms existing methods on both pre-training and transfer learning tasks, underscoring its potential as a powerful foundation model for graph-based applications.
中文摘要:REEF框架创新性地将关系标记作为图基础模型的基本单元,通过超网络和混合数据集预训练策略,在图形学习中展现出卓越的泛化能力。
English Summary: The REEF framework introduces relation tokens as fundamental units for graph foundation models, utilizing hypernetworks and a mixed-dataset pre-training strategy to achieve superior generalization across graph learning tasks.

Authors:Qi Zhou, Jie Zhang, Dongxia Wang, Qiang Liu, Tianlin Li, Jin Song Dong, Wenhai Wang, Qing Guo
Title: Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity
Abstract:
Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.
中文摘要:Fair-PP数据集通过GPT-4o-mini基于真实社会调查数据生成23.8万条个性化偏好记录,填补了LLM个性化偏好数据的空白,并提出自动化生成框架、区域模型分析和个性化对齐方法,实验证明其优于基线模型。
English Summary: The Fair-PP dataset addresses the gap in personalized human preference data for LLMs by synthesizing 238,623 preference records from diverse social groups using GPT-4o-mini, and introduces an automated generation framework, regional LLM analysis, and a persona-aligned reweighting method that outperforms baselines.

Authors:Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Karthikeyan K, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen, Bhuwan Dhingra
Title: Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining
Abstract:
Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are 'in-batch' examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose 'Breaking the Batch Barrier' (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with B3 surpass existing state-of-the-art results even with a batch size as small as 64, which is 4-16x smaller than that required by other methods.
Chinese: B3方法通过使用教师模型构建相似性图并识别强负例簇,为对比学习构建高质量批次,在MMEB基准测试中以显著更小的批次规模实现了最先进的性能。
English: The proposed B3 method constructs high-quality batches for contrastive learning by using a teacher model to create a similarity graph and identify clusters of strong negatives, achieving state-of-the-art results on the MMEB benchmark with significantly smaller batch sizes.

Authors:Liam Boyle, Jonas Kühne, Nicolas Baumann, Niklas Bastuck, Michele Magno
Title: Planar Velocity Estimation for Fast-Moving Mobile Robots Using Event-Based Optical Flow
Abstract:
Accurate velocity estimation is critical in mobile robotics, particularly for driver assistance systems and autonomous driving. Wheel odometry fused with Inertial Measurement Unit (IMU) data is a widely used method for velocity estimation; however, it typically requires strong assumptions, such as non-slip steering, or complex vehicle dynamics models that do not hold under varying environmental conditions like slippery surfaces. We introduce an approach to velocity estimation that is decoupled from wheel-to-surface traction assumptions by leveraging planar kinematics in combination with optical flow from event cameras pointed perpendicularly at the ground. The asynchronous micro-second latency and high dynamic range of event cameras make them highly robust to motion blur, a common challenge in vision-based perception techniques for autonomous driving. The proposed method is evaluated through in-field experiments on a 1:10 scale autonomous racing platform and compared to precise motion capture data, demonstrating not only performance on par with the state-of-the-art Event-VIO method but also a 38.3 % improvement in lateral error. Qualitative experiments at highway speeds of up to 32 m/s further confirm the effectiveness of our approach, indicating significant potential for real-world deployment.
中文: 本文提出了一种新颖的移动机器人速度估计方法,通过结合平面运动学和地面事件相机的光流技术,摆脱了对车轮与地面附着条件的依赖,在自主赛车测试中展现出更优性能并显著降低了横向误差。
English: This paper presents a novel velocity estimation method for mobile robotics that combines planar kinematics with ground-facing event cameras' optical flow, eliminating dependency on wheel-surface traction assumptions and demonstrating superior performance with reduced lateral errors in autonomous racing tests.

Authors:Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, Hao Dong
Title: DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy
Abstract:
Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: https://wayrise.github.io/DexGarmentLab/.
中文: DexGarmentLab首创了专用于灵巧双手衣物操作的环境,通过改进模拟技术和提出分层操作策略HALO,利用可迁移的功用点定位和泛化轨迹生成,在未接触过的衣物形状与变形场景中显著优于现有方法。
English: DexGarmentLab introduces the first specialized environment for dexterous bimanual garment manipulation, featuring refined simulations and a hierarchical policy (HALO) that outperforms existing methods by generalizing to unseen garment variations through transferable affordance points and trajectories.

Authors:Maximilian Tölle, Theo Gruner, Daniel Palenicek, Tim Schneider, Jonas Günster, Joe Watson, Davide Tateo, Puze Liu, Jan Peters
Title: Towards Safe Robot Foundation Models Using Inductive Biases
Abstract:
Safety is a critical requirement for the real-world deployment of robotic systems. Unfortunately, while current robot foundation models show promising generalization capabilities across a wide variety of tasks, they fail to address safety, an important aspect for ensuring long-term operation. Current robot foundation models assume that safe behavior should emerge by learning from a sufficiently large dataset of demonstrations. However, this approach has two clear major drawbacks. Firstly, there are no formal safety guarantees for a behavior cloning policy trained using supervised learning. Secondly, without explicit knowledge of any safety constraints, the policy may require an unreasonable number of additional demonstrations to even approximate the desired constrained behavior. To solve these key issues, we show how we can instead combine robot foundation models with geometric inductive biases using ATACOM, a safety layer placed after the foundation policy that ensures safe state transitions by enforcing action constraints. With this approach, we can ensure formal safety guarantees for generalist policies without providing extensive demonstrations of safe behavior, and without requiring any specific fine-tuning for safety. Our experiments show that our approach can be beneficial both for classical manipulation tasks, where we avoid unwanted collisions with irrelevant objects, and for dynamic tasks, such as the robot air hockey environment, where we can generate fast trajectories respecting complex tasks and joint space constraints.
中文: 本文通过将机器人基础模型与ATACOM安全层相结合,解决了现有模型的安全局限性,该安全层通过强制动作约束来确保形式化安全保证,无需大量演示或专门微调。
English: This paper addresses the safety limitations of current robot foundation models by integrating them with ATACOM, a safety layer that enforces action constraints to ensure formal safety guarantees without requiring extensive demonstrations or fine-tuning.

Authors:Han Peng, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Lei Fang
Title: CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability
Abstract:
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.
Chinese: CAFE是一种两阶段由粗到精的方法,通过筛选和引导注意力至相关内容来增强多文档问答能力,相较于现有方法实现了显著的性能提升。
English: CAFE is a two-stage coarse-to-fine method that enhances multi-document question-answering by filtering and steering attention to relevant content, achieving significant performance improvements over existing methods.

Authors:Tianyu Jiao, Zhuoran Xiao, Yihang Huang, Chenhui Ye, Yijia Feng, Liyu Cai, Jiang Chang, Fangkun Liu, Yin Xu, Dazhi He, Yunfeng Guan, Wenjun Zhang
Title: AI2MMUM: AI-AI Oriented Multi-Modal Universal Model Leveraging Telecom Domain Large Model
Abstract:
Designing a 6G-oriented universal model capable of processing multi-modal data and executing diverse air interface tasks has emerged as a common goal in future wireless systems. Building on our prior work in communication multi-modal alignment and telecom large language model (LLM), we propose a scalable, task-aware artificial intelligence-air interface multi-modal universal model (AI2MMUM), which flexibility and effectively perform various physical layer tasks according to subtle task instructions. The LLM backbone provides robust contextual comprehension and generalization capabilities, while a fine-tuning approach is adopted to incorporate domain-specific knowledge. To enhance task adaptability, task instructions consist of fixed task keywords and learnable, implicit prefix prompts. Frozen radio modality encoders extract universal representations and adapter layers subsequently bridge radio and language modalities. Moreover, lightweight task-specific heads are designed to directly output task objectives. Comprehensive evaluations demonstrate that AI2MMUM achieves SOTA performance across five representative physical environment/wireless channel-based downstream tasks using the WAIR-D and DeepMIMO datasets.
中文摘要:本研究提出AI2MMUM这一可扩展的智能空口多模态通用模型,通过电信大语言模型框架与任务自适应机制,在五项物理层无线任务中实现了最优性能。
English Summary: The study introduces AI2MMUM, a scalable AI-air interface model leveraging a telecom LLM backbone and task-specific adaptations to achieve state-of-the-art performance across multiple physical layer wireless tasks.

Authors:Yi Gui, Zhen Li, Zhongyi Zhang, Yao Wan, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, Xiangliang Zhang
Title: UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs
Abstract:
Automating the synthesis of User Interfaces (UIs) plays a crucial role in enhancing productivity and accelerating the development lifecycle, reducing both development time and manual effort. Recently, the rapid development of Multimodal Large Language Models (MLLMs) has made it possible to generate front-end Hypertext Markup Language (HTML) code directly from webpage designs. However, real-world webpages encompass not only a diverse array of HTML tags but also complex stylesheets, resulting in significantly lengthy code. The lengthy code poses challenges for the performance and efficiency of MLLMs, especially in capturing the structural information of UI designs. To address these challenges, this paper proposes UICopilot, a novel approach to automating UI synthesis via hierarchical code generation from webpage designs. The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML hierarchical structure, followed by the generation of fine-grained code. To validate the effectiveness of UICopilot, we conduct experiments on a real-world dataset, i.e., WebCode2M. Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation metrics and human evaluations. Specifically, statistical analysis reveals that the majority of human annotators prefer the webpages generated by UICopilot over those produced by GPT-4V.
Chinese: UICopilot通过采用两阶段分层代码生成方法,先构建粗粒度HTML结构再生成细粒度代码,显著提升了用户界面合成的自动化水平,在自动评估和人工评价中均优于包括GPT-4V在内的现有基线方法。
English: UICopilot enhances UI synthesis by employing a two-stage hierarchical code generation process that first creates coarse-grained HTML structures and then fine-grained code, outperforming existing methods including GPT-4V in both automated metrics and human evaluations.

Authors:Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Pierre Richemond, Acyr Locatelli, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, Sara Hooker
Title: Aya Vision: Advancing the Frontier of Multilingual Multimodality
Abstract:
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.
中文: Aya Vision模型通过合成标注创建多样化多语言指令数据,并采用跨模态合并技术避免灾难性遗忘,从而在性能上超越更大规模的竞争对手。
English: The Aya Vision models overcome multilingual multimodal challenges through synthetic annotation for diverse instruction data and cross-modal merging to prevent catastrophic forgetting, achieving superior performance against larger competitors.

Authors:Ruichu Cai, Xi Chen, Jie Qiao, Zijian Li, Yuequn Liu, Wei Chen, Keli Zhang, Jiale Zheng
Title: An Identifiable Cost-Aware Causal Decision-Making Framework Using Counterfactual Reasoning
Abstract:
Decision making under abnormal conditions is a critical process that involves evaluating the current state and determining the optimal action to restore the system to a normal state at an acceptable cost. However, in such scenarios, existing decision-making frameworks highly rely on reinforcement learning or root cause analysis, resulting in them frequently neglecting the cost of the actions or failing to incorporate causal mechanisms adequately. By relaxing the existing causal decision framework to solve the necessary cause, we propose a minimum-cost causal decision (MiCCD) framework via counterfactual reasoning to address the above challenges. Emphasis is placed on making counterfactual reasoning processes identifiable in the presence of a large amount of mixed anomaly data, as well as finding the optimal intervention state in a continuous decision space. Specifically, it formulates a surrogate model based on causal graphs, using abnormal pattern clustering labels as supervisory signals. This enables the approximation of the structural causal model among the variables and lays a foundation for identifiable counterfactual reasoning. With the causal structure approximated, we then established an optimization model based on counterfactual estimation. The Sequential Least Squares Programming (SLSQP) algorithm is further employed to optimize intervention strategies while taking costs into account. Experimental evaluations on both synthetic and real-world datasets reveal that MiCCD outperforms conventional methods across multiple metrics, including F1-score, cost efficiency, and ranking quality(nDCG@k values), thus validating its efficacy and broad applicability.
Chinese: 提出的最小成本因果决策(MiCCD)框架利用反事实推理和因果图优化干预策略,在合成和真实数据集上均展现出比传统方法更优的成本效益和准确性。
English: The proposed minimum-cost causal decision (MiCCD) framework utilizes counterfactual reasoning and causal graphs to optimize intervention strategies, demonstrating superior performance in cost efficiency and accuracy compared to traditional methods on both synthetic and real-world datasets.

Authors:Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu
Title: Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Abstract:
Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.
Chinese: 梯度稀疏自编码器(GradSAE)通过引入梯度信息来识别稀疏自编码器中最具影响力的潜在特征,解决了传统方法忽视对模型输出因果影响的局限。
English: The Gradient Sparse Autoencoder (GradSAE) is introduced to identify influential latents in sparse autoencoders by integrating gradient information, addressing the limitation of traditional methods that ignore causal effects on model outputs.

Authors:Benedict Hildisch, Edoardo Ghignone, Nicolas Baumann, Cheng Hu, Andrea Carron, Michele Magno
Title: Drive Fast, Learn Faster: On-Board RL for High Performance Autonomous Racing
Abstract:
Autonomous racing presents unique challenges due to its non-linear dynamics, the high speed involved, and the critical need for real-time decision-making under dynamic and unpredictable conditions. Most traditional Reinforcement Learning (RL) approaches rely on extensive simulation-based pre-training, which faces crucial challenges in transfer effectively to real-world environments. This paper introduces a robust on-board RL framework for autonomous racing, designed to eliminate the dependency on simulation-based pre-training enabling direct real-world adaptation. The proposed system introduces a refined Soft Actor-Critic (SAC) algorithm, leveraging a residual RL structure to enhance classical controllers in real-time by integrating multi-step Temporal-Difference (TD) learning, an asynchronous training pipeline, and Heuristic Delayed Reward Adjustment (HDRA) to improve sample efficiency and training stability. The framework is validated through extensive experiments on the F1TENTH racing platform, where the residual RL controller consistently outperforms the baseline controllers and achieves up to an 11.5 % reduction in lap times compared to the State-of-the-Art (SotA) with only 20 min of training. Additionally, an End-to-End (E2E) RL controller trained without a baseline controller surpasses the previous best results with sustained on-track learning. These findings position the framework as a robust solution for high-performance autonomous racing and a promising direction for other real-time, dynamic autonomous systems.
中文: 本文提出了一种用于自主赛车的鲁棒车载强化学习框架,通过改进的柔性演员-评论家算法结合多步时序差分学习和启发式奖励调整,无需仿真预训练即可实现高达11.5%的圈速提升。
English: This paper introduces a robust on-board reinforcement learning framework for autonomous racing that eliminates simulation dependency, employing an enhanced Soft Actor-Critic algorithm with multi-step TD learning and heuristic reward adjustment to achieve up to 11.5% faster lap times with minimal training.

Authors:Alessandro Navone, Mauro Martini, Marcello Chiaberge
Title: Autonomous Robotic Pruning in Orchards and Vineyards: a Review
Abstract:
Manual pruning is labor intensive and represents up to 25% of annual labor costs in fruit production, notably in apple orchards and vineyards where operational challenges and cost constraints limit the adoption of large-scale machinery. In response, a growing body of research is investigating compact, flexible robotic platforms capable of precise pruning in varied terrains, particularly where traditional mechanization falls short. This paper reviews recent advances in autonomous robotic pruning for orchards and vineyards, addressing a critical need in precision agriculture. Our review examines literature published between 2014 and 2024, focusing on innovative contributions across key system components. Special attention is given to recent developments in machine vision, perception, plant skeletonization, and control strategies, areas that have experienced significant influence from advancements in artificial intelligence and machine learning. The analysis situates these technological trends within broader agricultural challenges, including rising labor costs, a decline in the number of young farmers, and the diverse pruning requirements of different fruit species such as apple, grapevine, and cherry trees. By comparing various robotic architectures and methodologies, this survey not only highlights the progress made toward autonomous pruning but also identifies critical open challenges and future research directions. The findings underscore the potential of robotic systems to bridge the gap between manual and mechanized operations, paving the way for more efficient, sustainable, and precise agricultural practices.
中文摘要:本文综述了果园和葡萄园自主修剪机器人的最新进展,重点探讨人工智能在机器视觉与控制策略方面的突破如何应对劳动力短缺问题,并指出了未来研究方向。
English Summary: This paper reviews recent advances in autonomous robotic pruning systems for orchards and vineyards, highlighting how AI-driven technologies in machine vision and control strategies address labor shortages and operational challenges while identifying future research directions.

Authors:Timing Li, Bing Cao, Pengfei Zhu, Bin Xiao, Qinghua Hu
Title: Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion
Abstract:
Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbf{B}i-directional \textbf{S}elf-\textbf{R}egistration framework (\textbf{B-SR}). Specifically, B-SR utilizes a proxy data generator (PDG) and an inverse proxy data generator (IPDG) to achieve self-supervised global-local registration. Visible-infrared image pairs with spatially misaligned differences are aligned to obtain global differences through the registration module. The same image pairs are processed by PDG, such as cropping, flipping, stitching, etc., and then aligned to obtain local differences. IPDG converts the obtained local differences into pseudo-global differences, which are used to perform global-local difference consistency with the global differences. Furthermore, aiming at eliminating the effect of modal gaps on the registration module, we design a neighborhood dynamic alignment loss to achieve cross-modal image edge alignment. Extensive experiments on misaligned multi-modal images demonstrate the effectiveness of the proposed method in multi-modal image alignment and fusion against the competing methods. Our code will be publicly available.
中文: 提出的双向自注册(B-SR)框架通过结合全局-局部配准和跨模态边缘对齐,有效解决了多模态图像配准问题,在图像融合任务中展现出卓越性能。
English: The proposed self-supervised Bi-directional Self-Registration (B-SR) framework effectively aligns multi-modal images by integrating global-local registration and cross-modal edge alignment, demonstrating superior performance in image fusion tasks.

Authors:Mariam Elgamal, Abdulrahman Mahmoud, Gu-Yeon Wei, David Brooks, Gage Hills
Title: Modeling PFAS in Semiconductor Manufacturing to Quantify Trade-offs in Energy Efficiency and Environmental Impact of Computing Systems
Abstract:
The electronics and semiconductor industry is a prominent consumer of per- and poly-fluoroalkyl substances (PFAS), also known as forever chemicals. PFAS are persistent in the environment and can bioaccumulate to ecological and human toxic levels. Computer designers have an opportunity to reduce the use of PFAS in semiconductors and electronics manufacturing, including integrated circuits (IC), batteries, displays, etc., which currently account for a staggering 10% of the total PFAS fluoropolymers usage in Europe alone. In this paper, we present a framework where we (1) quantify the environmental impact of PFAS in computing systems manufacturing with granular consideration of the metal layer stack and patterning complexities in IC manufacturing at the design phase, (2) identify contending trends between embodied carbon (carbon footprint due to hardware manufacturing) versus PFAS. For example, manufacturing an IC at a 7 nm technology node using EUV lithography uses 18% less PFAS-containing layers, compared to manufacturing the same IC at a 7 nm technology node using DUV immersion lithography (instead of EUV) unlike embodied carbon trends, and (3) conduct case studies to illustrate how to optimize and trade-off designs with lower PFAS, while meeting power-performance-area constraints. We show that optimizing designs to use less back-end-of-line (BEOL) metal stack layers can save 1.7$\times$ PFAS-containing layers in systolic arrays.
中文: 电子行业大量使用持久性PFAS化学品,本文提出一个框架来量化其环境影响,权衡与碳排放的关系,并通过优化设计在保持性能的同时减少PFAS使用。
English: The electronics industry significantly uses persistent PFAS chemicals, and this paper introduces a framework to quantify their environmental impact, identify trade-offs with carbon emissions, and optimize designs to reduce PFAS usage while maintaining performance.

Authors:Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Title: Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Abstract:
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
中文: 本文提出一种零样本视频到文本摘要方法,通过生成剧本表征整合关键视觉与文本信息,在减少75%视频输入的同时比先进视觉语言模型多捕捉20%相关视觉内容,并引入了多模态评估指标MFactSum来解决现有度量不足。
English: This paper introduces a zero-shot video-to-text summarization method that creates screenplay representations to integrate visual and textual content, outperforming advanced VLMs by capturing 20% more relevant visual information with 75% less video input, and proposes MFactSum, a new multimodal evaluation metric.

Authors:Umberto Albertin, Mauro Martini, Alessandro Navone, Marcello Chiaberge
Title: Adaptive Robot Localization with Ultra-wideband Novelty Detection
Abstract:
Ultra-wideband (UWB) technology has shown remarkable potential as a low-cost general solution for robot localization. However, limitations of the UWB signal for precise positioning arise from the disturbances caused by the environment itself, due to reflectance, multi-path effect, and Non-Line-of-Sight (NLOS) conditions. This problem is emphasized in cluttered indoor spaces where service robotic platforms usually operate. Both model-based and learning-based methods are currently under investigation to precisely predict the UWB error patterns. Despite the great capability in approximating strong non-linearity, learning-based methods often do not consider environmental factors and require data collection and re-training for unseen data distributions, making them not practically feasible on a large scale. The goal of this research is to develop a robust and adaptive UWB localization method for indoor confined spaces. A novelty detection technique is used to recognize outlier conditions from nominal UWB range data with a semi-supervised autoencoder. Then, the obtained novelty scores are combined with an Extended Kalman filter, leveraging a dynamic estimation of covariance and bias error for each range measurement received from the UWB anchors. The resulting solution is a compact, flexible, and robust system which enables the localization system to adapt the trustworthiness of UWB data spatially and temporally in the environment. The extensive experimentation conducted with a real robot in a wide range of testing scenarios demonstrates the advantages and benefits of the proposed solution in indoor cluttered spaces presenting NLoS conditions, reaching an average improvement of almost 60% and greater than 25cm of absolute positioning error.
中文摘要:本研究提出了一种结合异常值检测和扩展卡尔曼滤波的鲁棒超宽带定位方法,能够动态适应环境干扰,在复杂室内环境中实现定位精度超过60%的提升。
English Summary: This research develops a robust UWB localization method using novelty detection and an Extended Kalman Filter to dynamically adapt to environmental disturbances, achieving over 60% improvement in positioning accuracy in cluttered indoor spaces.

Authors:Tongda Xu, Jiahao Li, Bin Li, Yan Wang, Ya-Qin Zhang, Yan Lu
Title: PICD: Versatile Perceptual Image Compression with Diffusion Rendering
Abstract:
Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, for screen content, existing methods often produce noticeable artifacts when compressing text. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (PICD), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.
中文摘要:提出的PICD方法通过扩散模型分别编码和渲染文本与图像,有效提升了屏幕图像和自然图像的感知压缩质量,在文本准确性和视觉保真度方面均优于现有技术。
English Summary: The proposed PICD method enhances perceptual image compression for both screen and natural images by using a diffusion model to separately encode and render text and image components, achieving superior text accuracy and visual quality.

Authors:Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu
Title: EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
Abstract:
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Ψ$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
中文: 提出的增强任意模型(EAM)通过引入新型Ψ-DiT模块和渐进式掩码图像建模策略,利用扩散变换器实现盲超分辨率,在多个数据集上取得了最先进的性能表现。
English: The proposed Enhancing Anything Model (EAM) introduces a novel Ψ-DiT block and progressive Masked Image Modeling strategy to leverage Diffusion Transformers for blind super-resolution, achieving state-of-the-art performance across multiple datasets.

Authors:Ruichu Cai, Junjie Wan, Weilin Chen, Zeqin Yang, Zijian Li, Peng Zhen, Jiecheng Guo
Title: Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
Abstract:
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders, with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
Chinese Summary: 本文提出一种利用数据自然异质性识别潜在混杂变量的新方法,无需依赖理想化假设即可估计长期个体因果效应,并通过大量实验验证了其有效性。
English Summary: This paper introduces a novel method for estimating long-term individual causal effects by leveraging data heterogeneity to identify latent confounders, eliminating the need for restrictive assumptions and demonstrating effectiveness through extensive experiments.

Authors:Minh K. Quan, Pubudu N. Pathirana, Mayuri Wijayasundara, Sujeeva Setunge, Dinh C. Nguyen, Christopher G. Brinton, David J. Love, H. Vincent Poor
Title: Federated Learning for Cyber Physical Systems: A Comprehensive Survey
Abstract:
The integration of machine learning (ML) in cyber physical systems (CPS) is a complex task due to the challenges that arise in terms of real-time decision making, safety, reliability, device heterogeneity, and data privacy. There are also open research questions that must be addressed in order to fully realize the potential of ML in CPS. Federated learning (FL), a distributed approach to ML, has become increasingly popular in recent years. It allows models to be trained using data from decentralized sources. This approach has been gaining popularity in the CPS field, as it integrates computer, communication, and physical processes. Therefore, the purpose of this work is to provide a comprehensive analysis of the most recent developments of FL-CPS, including the numerous application areas, system topologies, and algorithms developed in recent years. The paper starts by discussing recent advances in both FL and CPS, followed by their integration. Then, the paper compares the application of FL in CPS with its applications in the internet of things (IoT) in further depth to show their connections and distinctions. Furthermore, the article scrutinizes how FL is utilized in critical CPS applications, e.g., intelligent transportation systems, cybersecurity services, smart cities, and smart healthcare solutions. The study also includes critical insights and lessons learned from various FL-CPS implementations. The paper's concluding section delves into significant concerns and suggests avenues for further research in this fast-paced and dynamic era.
中文: 本文全面分析了联邦学习在信息物理系统中的集成应用,涵盖系统架构、算法实现及与物联网应用的对比,并探讨了未来研究方向。
English: This paper provides a comprehensive analysis of federated learning's integration into cyber-physical systems, examining applications, system architectures, and algorithms while comparing it with IoT implementations and highlighting future research directions.

Authors:Wenzhao Liu, Haoran Li, Congying Han, Zicheng Zhang, Anqi Li, Tiande Guo
Title: Purity Law for Generalizable Neural TSP Solvers
Abstract:
Achieving generalization in neural approaches across different scales and distributions remains a significant challenge for the Traveling Salesman Problem~(TSP). A key obstacle is that neural networks often fail to learn robust principles for identifying universal patterns and deriving optimal solutions from diverse instances. In this paper, we first uncover Purity Law (PuLa), a fundamental structural principle for optimal TSP solutions, defining that edge prevalence grows exponentially with the sparsity of surrounding vertices. Statistically validated across diverse instances, PuLa reveals a consistent bias toward local sparsity in global optima. Building on this insight, we propose Purity Policy Optimization~(PUPO), a novel training paradigm that explicitly aligns characteristics of neural solutions with PuLa during the solution construction process to enhance generalization. Extensive experiments demonstrate that PUPO can be seamlessly integrated with popular neural solvers, significantly enhancing their generalization performance without incurring additional computational overhead during inference.
Chinese Summary: 本文提出了旅行商问题最优解的结构性原理——纯度定律(PuLa),并开发了纯度策略优化(PUPO)训练范式,通过在求解过程中显式对齐神经解与纯度定律来显著提升泛化性能。
English Summary: The paper introduces the Purity Law (PuLa), a structural principle for optimal TSP solutions, and proposes Purity Policy Optimization (PUPO) to enhance neural network generalization by aligning solutions with this law during training.

Authors:Qianru Zhang, Liang Qu, Honggang Wen, Dong Huang, Siu-Ming Yiu, Nguyen Quoc Viet Hung, Hongzhi Yin
Title: M2Rec: Multi-scale Mamba for Efficient Sequential Recommendation
Abstract:
Sequential recommendation systems aim to predict users' next preferences based on their interaction histories, but existing approaches face critical limitations in efficiency and multi-scale pattern recognition. While Transformer-based methods struggle with quadratic computational complexity, recent Mamba-based models improve efficiency but fail to capture periodic user behaviors, leverage rich semantic information, or effectively fuse multimodal features. To address these challenges, we propose \model, a novel sequential recommendation framework that integrates multi-scale Mamba with Fourier analysis, Large Language Models (LLMs), and adaptive gating. First, we enhance Mamba with Fast Fourier Transform (FFT) to explicitly model periodic patterns in the frequency domain, separating meaningful trends from noise. Second, we incorporate LLM-based text embeddings to enrich sparse interaction data with semantic context from item descriptions. Finally, we introduce a learnable gate mechanism to dynamically balance temporal (Mamba), frequency (FFT), and semantic (LLM) features, ensuring harmonious multimodal fusion. Extensive experiments demonstrate that \model\ achieves state-of-the-art performance, improving Hit Rate@10 by 3.2\% over existing Mamba-based models while maintaining 20\% faster inference than Transformer baselines. Our results highlight the effectiveness of combining frequency analysis, semantic understanding, and adaptive fusion for sequential recommendation. Code and datasets are available at: https://anonymous.4open.science/r/M2Rec.
中文: 本文提出了一种新颖的序列推荐框架,通过结合多尺度Mamba、傅里叶分析、大语言模型和自适应门控机制,有效解决了现有方法在效率和多尺度模式识别上的不足,在保持比Transformer基线更快推理速度的同时实现了最优性能。
English: This paper introduces a novel sequential recommendation framework that integrates multi-scale Mamba with Fourier analysis, LLMs, and adaptive gating to overcome limitations in efficiency and multi-scale pattern recognition, achieving state-of-the-art performance with faster inference than Transformer baselines.

Authors:Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo
Title: Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Abstract:
Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.
中文: FASTLIBRA 是一种多LoRA缓存系统,通过管理适配器与KV缓存间的使用依赖关系优化推理性能,相比现有技术将首令牌生成时间平均降低了63.4%。
English: FASTLIBRA is a Multi-LoRA caching system that enhances inference performance by managing dependencies between adapters and KV caches, reducing Time-To-First-Token by 63.4% compared to existing methods.

Authors:Faizan M. Tariq, Zheng-Hang Yeh, Avinash Singh, David Isele, Sangjae Bae
Title: Frenet Corridor Planner: An Optimal Local Path Planning Framework for Autonomous Driving
Abstract:
Motivated by the requirements for effectiveness and efficiency, path-speed decomposition-based trajectory planning methods have widely been adopted for autonomous driving applications. While a global route can be pre-computed offline, real-time generation of adaptive local paths remains crucial. Therefore, we present the Frenet Corridor Planner (FCP), an optimization-based local path planning strategy for autonomous driving that ensures smooth and safe navigation around obstacles. Modeling the vehicles as safety-augmented bounding boxes and pedestrians as convex hulls in the Frenet space, our approach defines a drivable corridor by determining the appropriate deviation side for static obstacles. Thereafter, a modified space-domain bicycle kinematics model enables path optimization for smoothness, boundary clearance, and dynamic obstacle risk minimization. The optimized path is then passed to a speed planner to generate the final trajectory. We validate FCP through extensive simulations and real-world hardware experiments, demonstrating its efficiency and effectiveness.
中文: Frenet走廊规划器(FCP)是一种基于优化的局部路径规划方法,通过定义可行驶走廊避开障碍物并优化路径平滑性与安全性,已在仿真和实际测试中验证其高效性和有效性。
English: The Frenet Corridor Planner (FCP) is an optimization-based local path planning method for autonomous driving that ensures smooth and safe navigation by defining drivable corridors around obstacles and optimizing paths for smoothness and safety, with validation through simulations and real-world tests.

Authors:Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang
Title: RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Abstract:
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.
Chinese: RetroInfer提出了一种创新系统,将KV缓存重构为带有波索引和缓冲的向量存储,通过利用注意力稀疏性和高效的硬件协调,在保持全精度的情况下实现了长上下文LLM推理高达10.5倍的加速。
English: RetroInfer introduces a novel system that reimagines the KV cache as a vector storage with a wave index and buffer, achieving up to 10.5x speedup in long-context LLM inference while maintaining full accuracy by leveraging attention sparsity and efficient hardware coordination.

Authors:Dyuman Aditya, Junning Huang, Nico Bohlinger, Piotr Kicki, Krzysztof Walas, Jan Peters, Matteo Luperto, Davide Tateo
Title: Robust Localization, Mapping, and Navigation for Quadruped Robots
Abstract:
Quadruped robots are currently a widespread platform for robotics research, thanks to powerful Reinforcement Learning controllers and the availability of cheap and robust commercial platforms. However, to broaden the adoption of the technology in the real world, we require robust navigation stacks relying only on low-cost sensors such as depth cameras. This paper presents a first step towards a robust localization, mapping, and navigation system for low-cost quadruped robots. In pursuit of this objective we combine contact-aided kinematic, visual-inertial odometry, and depth-stabilized vision, enhancing stability and accuracy of the system. Our results in simulation and two different real-world quadruped platforms show that our system can generate an accurate 2D map of the environment, robustly localize itself, and navigate autonomously. Furthermore, we present in-depth ablation studies of the important components of the system and their impact on localization accuracy. Videos, code, and additional experiments can be found on the project website: https://sites.google.com/view/low-cost-quadruped-slam
中文: 本文提出了一种基于低成本深度相机的四足机器人鲁棒定位、建图与导航系统,通过融合多种算法提升了稳定性与精度,并在仿真和实际平台中验证了其有效性。
English: This paper introduces a robust localization, mapping, and navigation system for low-cost quadruped robots using depth cameras and enhanced algorithms, demonstrating accurate performance in simulations and real-world tests.

Authors:Zhenyu Liu, Yi Ma, Rahim Tafazolli
Title: ResiTok: A Resilient Tokenization-Enabled Framework for Ultra-Low-Rate and Robust Image Transmission
Abstract:
Real-time transmission of visual data over wireless networks remains highly challenging, even when leveraging advanced deep neural networks, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose a novel Resilient Tokenization-Enabled (ResiTok) framework designed for ultra-low-rate image transmission that achieves exceptional robustness while maintaining high reconstruction quality. By reorganizing visual information into hierarchical token groups consisting of essential key tokens and supplementary detail tokens, ResiTok enables progressive encoding and graceful degradation of visual quality under constrained channel conditions. A key contribution is our resilient 1D tokenization method integrated with a specialized zero-out training strategy, which systematically simulates token loss during training, empowering the neural network to effectively compress and reconstruct images from incomplete token sets. Furthermore, the channel-adaptive coding and modulation design dynamically allocates coding resources according to prevailing channel conditions, yielding superior semantic fidelity and structural consistency even at extremely low channel bandwidth ratios. Evaluation results demonstrate that ResiTok outperforms state-of-the-art methods in both semantic similarity and visual quality, with significant advantages under challenging channel conditions.
中文: 本文提出的ResiTok框架通过弹性令牌化和自适应编码技术,在低带宽无线网络中实现鲁棒的高质量图像传输,在恶劣信道条件下性能显著优于现有方法。
English: This paper introduces the ResiTok framework, which uses resilient tokenization and adaptive coding to enable robust, high-quality image transmission over low-bandwidth wireless networks, outperforming existing methods under challenging conditions.

Authors:Yuwen Chen, Zafer Yildiz, Qihang Li, Yaqian Chen, Haoyu Dong, Hanxue Gu, Nicholas Konz, Maciej A. Mazurowski
Title: Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2
Abstract:
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on three public datasets covering organs, bones, and muscles across MRI and CT modalities. We show that the proposed method markedly outperforms the default SAM 2, achieving average Dice Similarity Coefficient improvement of 0.14 and 0.11 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
中文:提出的SLM-SAM 2模型通过整合独立的短期与长期记忆库,显著提升了医学图像标注的准确性和抗误差传播能力,在多种MRI与CT数据集上明显优于SAM 2。
English: The proposed SLM-SAM 2 model enhances medical image annotation by integrating separate short-term and long-term memory banks, significantly outperforming SAM 2 with improved segmentation accuracy and resistance to error propagation across various MRI and CT datasets.

Authors:Run Ling, Wenji Wang, Yuting Liu, Guibing Guo, Haowei Liu, Jian Lu, Quanwei Zhang, Yexing Xu, Shuo Lu, Yun Wang, Yihua Shao, Zhanjie Zhang, Ao Ma, Linying Jiang, Xingwei Wang
Title: RAGAR: Retrieval Augmented Personalized Image Generation Guided by Recommendation
Abstract:
Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.
Chinese: 提出的RAGAR方法通过检索机制根据历史项与参考图像的相似性分配权重,并引入排序任务优化个性化,在个性化和语义指标上均显著优于现有基准方法。
English: The proposed RAGAR method enhances personalized image generation by using a retrieval mechanism to weight historical items based on their similarity to the reference image and introducing a rank task to optimize personalization, outperforming existing baselines in both personalization and semantic metrics.

Authors:Cong Xu, Wenbin Liang, Mo Yu, Anan Liu, Ke-Yue Zhang, Shunli Wang, Lizhuang Ma, Jianyong Wang, Jun Wang, Wei Zhang
Title: Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Abstract:
The rapid scaling of models has led to prohibitively high training and fine-tuning costs. A major factor accounting for memory consumption is the widespread use of stateful optimizers (e.g., Adam), which maintain auxiliary information of even 2x the model size in order to achieve optimal convergence. We therefore present SOLO in this work to spawn a novel type of optimizer that requires an extremely light memory footprint. While previous efforts have achieved certain success in 8-bit or 4-bit cases, SOLO enables Adam-style optimizers to maintain quantized states with precision as low as 3 bits, or even 2 bits. This immense progress is due to the identification and resolution of two key challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum hyperparameter for the latter. SOLO can thus be seamlessly applied to Adam-style optimizers, leading to substantial memory savings with minimal accuracy loss.
中文: SOLO提出了一种新型内存高效优化器,通过解决量化中的关键问题,将Adam类优化器的内存占用降至2-3位,在保证精度的同时实现显著的内存节省。
English: SOLO introduces a novel memory-efficient optimizer that reduces Adam-style optimizer memory usage to as low as 2-3 bits by addressing quantization challenges, achieving significant memory savings with minimal accuracy loss.

Authors:Longjie Luo, Lin Li, Qingyang Hong
Title: SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition
Abstract:
Due to the lack of target speech annotations in real-recorded far-field conversational datasets, speech enhancement (SE) models are typically trained on simulated data. However, the trained models often perform poorly in real-world conditions, hindering their application in far-field speech recognition. To address the issue, we (a) propose direct sound estimation (DSE) to estimate the oracle direct sound of real-recorded data for SE; and (b) present a novel pseudo-supervised learning method, SuPseudo, which leverages DSE-estimates as pseudo-labels and enables SE models to directly learn from and adapt to real-recorded data, thereby improving their generalization capability. Furthermore, an SE model called FARNET is designed to fully utilize SuPseudo. Experiments on the MISP2023 corpus demonstrate the effectiveness of SuPseudo, and our system significantly outperforms the previous state-of-the-art. A demo of our method can be found at https://EeLLJ.github.io/SuPseudo/.
中文: 针对模拟数据训练的语音增强模型在真实场景中表现不佳的问题,本研究提出直接声学估计和SuPseudo学习方法,使模型能基于真实录音数据进行自适应训练,在MISP2023数据集上取得了突破性性能提升。
English: To overcome the limitations of simulated data in speech enhancement, this study introduces direct sound estimation and the SuPseudo learning method, enabling models to adapt to real-recorded data and achieve superior performance on the MISP2023 corpus.

Authors:Longjie Luo, Shenghui Lu, Lin Li, Qingyang Hong
Title: Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge
Abstract:
This paper presents our system for the MISP-Meeting Challenge Track 2. The primary difficulty lies in the dataset, which contains strong background noise, reverberation, overlapping speech, and diverse meeting topics. To address these issues, we (a) designed G-SpatialNet, a speech enhancement (SE) model to improve Guided Source Separation (GSS) signals; (b) proposed TLS, a framework comprising time alignment, level alignment, and signal-to-noise ratio filtering, to generate signal-level pseudo labels for real-recorded far-field audio data, thereby facilitating SE models' training; and (c) explored fine-tuning strategies, data augmentation, and multimodal information to enhance the performance of pre-trained Automatic Speech Recognition (ASR) models in meeting scenarios. Finally, our system achieved character error rates (CERs) of 5.44% and 9.52% on the Dev and Eval sets, respectively, with relative improvements of 64.8% and 52.6% over the baseline, securing second place.
中文: 本文针对MISP-Meeting挑战赛第二赛道,通过设计G-SpatialNet语音增强模型、提出TLS框架生成伪标签以及优化ASR模型,有效解决了强噪声、混响和重叠语音问题,最终以显著降低的错误率获得第二名。
English: This paper introduces a system for the MISP-Meeting Challenge Track 2 that tackles strong background noise, reverberation, and overlapping speech by developing G-SpatialNet for speech enhancement, creating a TLS framework for pseudo labels, and optimizing ASR models, achieving second place with significant error rate reductions.

Authors:Peijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li
Title: DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec
Abstract:
Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.
中文: 本文提出DS-Codec,一种采用镜像与非镜像架构切换的双阶段训练框架的神经语音编解码器,通过增强码本鲁棒性和平衡结构优势,实现了更优的高保真语音重建。
English: This paper introduces DS-Codec, a neural speech codec with a dual-stage training framework that alternates between mirror and non-mirror architectures, achieving superior speech reconstruction by enhancing codebook robustness and balancing structural advantages.

Authors:Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li
Title: Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Abstract:
Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.
中文: 现有零样本语音转换系统虽能合成未见说话者的语音,但难以准确复制源说话者或模仿目标说话者的独特说话风格,限制了可控性;而提出的Discl-VC框架通过解耦自监督语音表征中的内容和韵律信息,结合上下文学习与掩码生成变换器,实现了卓越的转换效果和精确的韵律控制。
English: Current zero-shot voice conversion systems can generate unseen speakers' voices but often fail to replicate or mimic speaking styles accurately, limiting controllability, whereas the proposed Discl-VC framework disentangles content and prosody for enhanced synthesis and precise prosody control through in-context learning and a mask generative transformer.

Authors:Haojie Jia, Zhenhao Li, Gen Li, Minxian Xu, Kejiang Ye
Title: SealOS+: A Sealos-based Approach for Adaptive Resource Optimization Under Dynamic Workloads for Securities Trading System
Abstract:
As securities trading systems transition to a microservices architecture, optimizing system performance presents challenges such as inefficient resource scheduling and high service response delays. Existing container orchestration platforms lack tailored performance optimization mechanisms for trading scenarios, making it difficult to meet the stringent 50ms response time requirement imposed by exchanges. This paper introduces SealOS+, a Sealos-based performance optimization approach for securities trading, incorporating an adaptive resource scheduling algorithm leveraging deep reinforcement learning, a three-level caching mechanism for trading operations, and a Long Short-Term Memory (LSTM) based load prediction model. Real-world deployment at a securities exchange demonstrates that the optimized system achieves an average CPU utilization of 78\%, reduces transaction response time to 105ms, and reaches a peak processing capacity of 15,000 transactions per second, effectively meeting the rigorous performance and reliability demands of securities trading.
中文摘要:本文提出SealOS+证券交易性能优化系统,通过深度强化学习资源调度、三级缓存和LSTM负载预测,实现105毫秒响应时间并达到每秒15,000笔交易处理能力。
English Summary: This paper presents SealOS+, a performance optimization system for securities trading that uses deep reinforcement learning for resource scheduling, a three-level cache, and LSTM load prediction to achieve 105ms response times and handle 15,000 transactions per second.

Authors:Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg
Title: NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
Abstract:
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
Chinese: 本研究提出了NGPU-LM方法,通过重构统计n元语言模型的数据结构,在GPU上实现高效并行运算,以不到7%的计算开销显著缩小了贪婪解码与束搜索在语音识别系统中的准确率差距。
English: This work introduces NGPU-LM, a GPU-optimized approach for statistical n-gram language models that enables efficient parallel operations with minimal computational overhead, significantly narrowing the accuracy gap between greedy and beam search decoding in ASR systems.

Authors:Xuhang Chen, Michael Kwok-Po Ng, Kim-Fung Tsang, Chi-Man Pun, Shuqiang Wang
Title: ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging
Abstract:
Brain network analysis plays a crucial role in diagnosing and monitoring neurodegenerative disorders such as Alzheimer's disease (AD). Existing approaches for constructing structural brain networks from diffusion tensor imaging (DTI) often rely on specialized toolkits that suffer from inherent limitations: operator subjectivity, labor-intensive workflows, and restricted capacity to capture complex topological features and disease-specific biomarkers. To overcome these challenges and advance computational neuroimaging instrumentation, ConnectomeDiffuser is proposed as a novel diffusion-based framework for automated end-to-end brain network construction from DTI. The proposed model combines three key components: (1) a Template Network that extracts topological features from 3D DTI scans using Riemannian geometric principles, (2) a diffusion model that generates comprehensive brain networks with enhanced topological fidelity, and (3) a Graph Convolutional Network classifier that incorporates disease-specific markers to improve diagnostic accuracy. ConnectomeDiffuser demonstrates superior performance by capturing a broader range of structural connectivity and pathology-related information, enabling more sensitive analysis of individual variations in brain networks. Experimental validation on datasets representing two distinct neurodegenerative conditions demonstrates significant performance improvements over other brain network methods. This work contributes to the advancement of instrumentation in the context of neurological disorders, providing clinicians and researchers with a robust, generalizable measurement framework that facilitates more accurate diagnosis, deeper mechanistic understanding, and improved therapeutic monitoring of neurodegenerative diseases such as AD.
中文:ConnectomeDiffuser是一种基于扩散的自动化框架,通过拓扑特征提取和疾病特异性生物标志物从DTI扫描构建脑网络,在诊断阿尔茨海默病等神经退行性疾病方面展现出卓越性能。
English: ConnectomeDiffuser is an automated diffusion-based framework that constructs brain networks from DTI scans using topological feature extraction and disease-specific biomarkers, demonstrating superior performance in diagnosing neurodegenerative disorders like Alzheimer's disease.

Authors:Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic
Title: SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation
Abstract:
Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
中文:提出的Spiral模型是一种新颖的基于距离视图的LiDAR扩散方法,能同时生成深度、反射率和语义图,以最少的参数量实现最优性能,并有效支持下游任务的合成数据增强。
English: The proposed Spiral model is a novel range-view LiDAR diffusion approach that simultaneously generates depth, reflectance, and semantic maps, achieving state-of-the-art performance with minimal parameters while enabling effective synthetic data augmentation for downstream tasks.

Authors:Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, Yuke Zhu
Title: SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning
Abstract:
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: https://ut-austin-rpl.github.io/SCIZOR/
中文:SCIZOR是一种自监督数据筛选框架,通过剔除次优和冗余的状态-动作对来优化模仿学习数据集,使策略性能平均提升15.4%且所需数据更少。
English: SCIZOR is a self-supervised framework that curates imitation learning datasets by filtering out both suboptimal and redundant state-action pairs, enhancing policy performance by 15.4% on average with less data.

Authors:Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Galletti, Rajarshi Ghosh, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Dominik Stammbach, Anna Steinberg, David Tomás, Steven R Wilson, Bowen Yi, Jessica H Zhu, Arkaitz Zubiaga, Anders Søgaard, Alexander Fraser, Zhijing Jin, Rada Mihalcea, Joel R. Tetreault, Daryna Dementieva
Title: NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment
Abstract:
Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.
中文: 尽管大语言模型的进展带来了巨大潜力,但自然语言处理领域需更负责任地部署技术,通过跨学科研究应对社会挑战,确保公平发展。
English: Recent advances in large language models offer vast potential, but the NLP field must prioritize responsible deployment and address societal challenges through interdisciplinary research to ensure equitable progress.

Authors:Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu
Title: Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
Abstract:
Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.
中文摘要:本研究提出了一种基于熵感知的门控融合框架,通过跨模态不确定性量化动态调节视觉信息流,并结合批处理视听混洗技术增强模型对不匹配场景的鲁棒性,在AudioCaps基准测试中表现出优越性能且推理速度显著提升。
English Summary: This study introduces an entropy-aware gated fusion framework that dynamically regulates visual information using cross-modal uncertainty analysis, combined with a batch-wise audiovisual shuffling technique to enhance model robustness against audiovisual misalignment, achieving superior performance on AudioCaps with significantly faster inference speeds.

Authors:Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey
Title: Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization
Abstract:
Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning, offering substantial reductions in trainable parameters by modeling updates as the product of two low-rank matrices. While effective, the low-rank constraint inherently limits representational capacity, often resulting in reduced performance compared to full-rank fine-tuning. Recent work by Ji et al. (2025) has addressed this limitation by applying a fixed-frequency sinusoidal transformation to low-rank adapters, increasing their stable rank without introducing additional parameters. This raises a crucial question: can the same sine-activated technique be successfully applied within the context of Post-Training Quantization to retain benefits even after model compression? In this paper, we investigate this question by extending the sinusoidal transformation framework to quantized LoRA adapters. We develop a theoretical analysis showing that the stable rank of a quantized adapter is tightly linked to that of its full-precision counterpart, motivating the use of such rank-enhancing functions even under quantization. Our results demonstrate that the expressivity gains from a sinusoidal non-linearity persist after quantization, yielding highly compressed adapters with negligible loss in performance. We validate our approach across a range of fine-tuning tasks for language, vision and text-to-image generation achieving significant memory savings while maintaining competitive accuracy.
中文总结:最新研究将正弦变换应用于量化LoRA适配器,证明该技术能在量化后保持表达能力增益,并在语言、视觉及文生图任务中实现显著内存节省。
English Summary: Recent research extends sinusoidal transformations to quantized LoRA adapters, demonstrating that this technique preserves expressivity gains post-quantization while achieving substantial memory savings across language, vision, and text-to-image tasks.

Authors:Xunpeng Huang, Yingyu Lin, Nikki Lijing Kuang, Hanze Dong, Difan Zou, Yian Ma, Tong Zhang
Title: Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion
Abstract:
Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoising processes. To address these challenges, we propose Quantized Transition Diffusion (QTD), a novel approach that integrates data quantization with discrete diffusion dynamics. Our method first transforms the continuous data distribution $p_*$ into a discrete one $q_*$ via histogram approximation and binary encoding, enabling efficient representation in a structured discrete latent space. We then design a continuous-time Markov chain (CTMC) with Hamming distance-based transitions as the forward process, which inherently supports long-range movements in the original data space. For reverse-time sampling, we introduce a \textit{truncated uniformization} technique to simulate the reverse CTMC, which can provably provide unbiased generation from $q_*$ under minimal score assumptions. Through a novel KL dynamic analysis of the reverse CTMC, we prove that QTD can generate samples with $O(d\ln^2(d/ε))$ score evaluations in expectation to approximate the $d$--dimensional target distribution $p_*$ within an $ε$ error tolerance. Our method not only establishes state-of-the-art inference efficiency but also advances the theoretical foundations of diffusion-based generative modeling by unifying discrete and continuous diffusion paradigms.
中文: 提出的量化转移扩散方法通过结合数据量化和离散扩散动力学,克服了连续扩散模型的效率限制,实现了数据空间的长程转移和无偏生成,同时达到了最先进的推理效率。
English: The proposed Quantized Transition Diffusion (QTD) method overcomes efficiency limitations in continuous diffusion models by combining data quantization with discrete diffusion dynamics, enabling long-range transitions and unbiased generation while achieving state-of-the-art inference efficiency.

Authors:Zhenghai You, Zhenyu Zhou, Lantian Li, Dong Wang
Title: An Investigation on Speaker Augmentation for End-to-End Speaker Extraction
Abstract:
Target confusion, defined as occasional switching to non-target speakers, poses a key challenge for end-to-end speaker extraction (E2E-SE) systems. We argue that this problem is largely caused by the lack of generalizability and discrimination of the speaker embeddings, and introduce a simple yet effective speaker augmentation strategy to tackle the problem. Specifically, we propose a time-domain resampling and rescaling pipeline that alters speaker traits while preserving other speech properties. This generates a variety of pseudo-speakers to help establish a generalizable speaker embedding space, while the speaker-trait-specific augmentation creates hard samples that force the model to focus on genuine speaker characteristics. Experiments on WSJ0-2Mix and LibriMix show that our method mitigates the target confusion and improves extraction performance. Moreover, it can be combined with metric learning, another effective approach to address target confusion, leading to further gains.
中文摘要:本文针对端到端说话人提取系统中的目标混淆问题,提出了一种说话人增强策略,通过时域重采样和重缩放生成伪说话人,增强模型对说话人特征的区分能力,在基准数据集上有效提升了提取性能。
English Summary: This paper addresses target confusion in end-to-end speaker extraction systems by proposing a speaker augmentation strategy that generates pseudo-speakers through time-domain resampling and rescaling, enhancing the model's ability to distinguish speaker traits and improving extraction performance on benchmark datasets.

Authors:Xuhang Chen, Zhuo Li, Yanyan Shen, Mufti Mahmud, Hieu Pham, Chi-Man Pun, Shuqiang Wang
Title: High-Fidelity Functional Ultrasound Reconstruction via A Visual Auto-Regressive Framework
Abstract:
Functional ultrasound (fUS) imaging provides exceptional spatiotemporal resolution for neurovascular mapping, yet its practical application is significantly hampered by critical challenges. Foremost among these are data scarcity, arising from ethical considerations and signal degradation through the cranium, which collectively limit dataset diversity and compromise the fairness of downstream machine learning models.
中文: 功能性超声成像虽能实现高分辨率神经血管成像,却因伦理限制和颅骨信号衰减导致数据稀缺,严重制约数据集多样性并损害下游机器学习模型的公平性。
English: Functional ultrasound imaging offers high-resolution neurovascular mapping but faces major hurdles including data scarcity from ethical issues and signal loss through the skull, which reduce dataset diversity and undermine machine learning model fairness.

Authors:Yihan Wang, Qiao Yan, Zhenghao Xing, Lihao Liu, Junjun He, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
Title: Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making
Abstract:
Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.
Chinese: 本研究提出“鲶鱼代理”概念,通过复杂性感知和语气校准干预机制,有效解决多智能体临床诊断中的过早共识问题,在多项基准测试中显著超越现有领先模型。
English: The study introduces a Catfish Agent to counteract premature consensus in multi-agent clinical diagnostics, employing complexity-aware and tone-calibrated interventions that significantly outperform existing models across multiple benchmarks.

Authors:Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai
Title: MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation
Abstract:
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.
中文: MV-CoLight提出了一种新颖的两阶段框架,通过直接光照建模和基于希尔伯特曲线的对齐方法,在2D和3D场景中实现光照一致的目标合成,在标准基准测试和实际场景中均展现出领先性能。
English: MV-CoLight introduces a novel two-stage framework that achieves illumination-consistent object compositing in both 2D and 3D scenes through direct lighting modeling and Hilbert curve-based alignment, demonstrating state-of-the-art performance across benchmarks and real-world applications.

Authors:Pedro Pereira, José Gonçalves, João Vitorino, Eva Maia, Isabel Praça
Title: Enhancing JavaScript Malware Detection through Weighted Behavioral DFAs
Abstract:
This work addresses JavaScript malware detection to enhance client-side web application security with a behavior-based system. The ability to detect malicious JavaScript execution sequences is a critical problem in modern web security as attack techniques become more sophisticated. This study introduces a new system for detecting JavaScript malware using a Deterministic Finite Automaton (DFA) along with a weighted-behavior system, which we call behavior DFA. This system captures malicious patterns and provides a dynamic mechanism to classify new sequences that exhibit partial similarity to known attacks, differentiating them between benign, partially malicious, and fully malicious behaviors. Experimental evaluation on a dataset of 1,058 sequences captured in a real-world environment demonstrates the capability of the system to detect and classify threats effectively, with the behavior DFA successfully identifying exact matches and partial similarities to known malicious behaviors. The results highlight the adaptability of the system in detecting emerging threats while maintaining transparency in decision making.
中文: 本研究提出一种基于行为特征的JavaScript恶意软件检测系统,采用确定性有限自动机结合加权行为机制,能够将代码序列分类为良性、部分恶意或完全恶意,并在真实环境测试中展现出有效的威胁识别能力。
English: This study presents a behavior-based JavaScript malware detection system using a Deterministic Finite Automaton with weighted behaviors to classify sequences as benign, partially malicious, or fully malicious, demonstrating effective threat identification in real-world testing.

Authors:Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, Roger Zimmermann
Title: Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts
Abstract:
Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.
中文: Uni3D-MoE是一种基于稀疏专家混合的多模态大语言模型,它整合了多种3D模态并采用可学习路由机制动态选择专业专家进行自适应融合,从而显著提升了三维场景理解的准确性。
English: Uni3D-MoE is a sparse Mixture-of-Experts-based multimodal large language model that integrates multiple 3D modalities and employs a learnable routing mechanism to dynamically select specialized experts for adaptive fusion, significantly enhancing 3D scene understanding accuracy.

Authors:Duzhen Zhang, Yong Ren, Chenxing Li, Dong Yu, Tielin Zhang
Title: Information-Theoretic Complementary Prompts for Improved Continual Text Classification
Abstract:
Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems -- the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences -- we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.
中文: 本文提出InfoComp方法,通过信息论框架分别学习任务特定和任务不变知识的提示空间,无需数据回放即可缓解灾难性遗忘并提升知识迁移能力,在持续文本分类中表现优异。
English: This paper introduces InfoComp, a novel approach for Continual Text Classification that learns task-specific and task-invariant knowledge in separate prompt spaces using an information-theoretic framework to mitigate catastrophic forgetting and enhance knowledge transfer without data replay.

Authors:Hemanth Saratchandran, Damien Teney, Simon Lucey
Title: Leaner Transformers: More Heads, Less Depth
Abstract:
Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).
中文: 本文挑战了Transformer模型“越大越好”的观念,通过增加注意力头数量改善条件化,使模型在保持精度的同时深度减少、参数降低30-50%,在视觉与语言任务中均验证了有效性。
English: This paper challenges the "bigger is better" trend in transformers by demonstrating that increasing the number of attention heads improves conditioning, allowing models to reduce depth and parameters by 30-50% while maintaining accuracy across various tasks.

Authors:Jian Wang, Boyan Zhu, Chak Tou Leong, Yongqi Li, Wenjie Li
Title: Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models
Abstract:
Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.
中文: 大型推理模型通过测试时扩展可提升性能,但本研究提出测试时扩展性能模型,确定了两种扩展策略的预算饱和点,证明超过该点后计算回报递减,并在多个基准测试中验证了这一发现。
English: Large reasoning models can boost performance through test-time scaling, but this study introduces a Test-Time Scaling Performance Model to identify budget saturation points where further computation yields diminishing returns, validated across multiple benchmarks.

Authors:Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, Lin Yang
Title: CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic
Abstract:
Recent advances in computational pathology have led to the emergence of numerous foundation models. However, these approaches fail to replicate the diagnostic process of pathologists, as they either simply rely on general-purpose encoders with multi-instance learning for classification or directly apply multimodal models to generate reports from images. A significant limitation is their inability to emulate the diagnostic logic employed by pathologists, who systematically examine slides at low magnification for overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. To address this gap, we introduce CPathAgent, an innovative agent-based model that mimics pathologists' reasoning processes by autonomously executing zoom-in/out and navigation operations across pathology images based on observed visual features. To achieve this, we develop a multi-stage training strategy unifying patch-level, region-level, and whole-slide capabilities within a single model, which is essential for mimicking pathologists, who require understanding and reasoning capabilities across all three scales. This approach generates substantially more detailed and interpretable diagnostic reports compared to existing methods, particularly for huge region understanding. Additionally, we construct an expert-validated PathMMU-HR$^{2}$, the first benchmark for huge region analysis, a critical intermediate scale between patches and whole slides, as diagnosticians typically examine several key regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across three scales of benchmarks, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for the future development of computational pathology.
中文: CPathAgent是一种基于智能体的模型,通过自主执行病理图像的缩放和导航操作来模拟病理医生的诊断逻辑,采用多阶段训练策略在多个尺度上超越现有方法,并生成更详细的诊断报告。
English: CPathAgent is an agent-based model that replicates pathologists' diagnostic logic by autonomously navigating pathology images through zoom-in/out operations, employing a multi-stage training strategy to outperform existing methods across multiple scales and generate more detailed diagnostic reports.

Authors:Avinash Madasu, Vasudev Lal, Phillip Howard
Title: Cultural Awareness in Vision-Language Models: A Cross-Country Exploration
Abstract:
Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.
中文: 本研究提出评估视觉语言模型中文化偏见的框架,发现其倾向于强化不同国家间关于种族、性别和身体特征的刻板印象。
English: This study introduces a framework to assess cultural biases in Vision-Language Models, revealing their tendency to reinforce stereotypes related to race, gender, and physical traits across different countries.

Authors:Changjian Jiang, Kerui Ren, Linning Xu, Jiong Chen, Jiangmiao Pang, Yu Zhang, Bo Dai, Mulin Yu
Title: HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes
Abstract:
High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles for geometry with Gaussian primitives for appearance, motivated by the lightweight classic geometry representations and their proven efficiency in real world applications. Our design yields a compact yet expressive model capable of photo realistic rendering across both indoor and outdoor environments, seamlessly adapting to varying levels of scene complexity. Experiments on multiple benchmark datasets demonstrate that our method yields both compact, accurate geometry and high fidelity renderings, especially in challenging scenarios where robust geometric structure make a clear difference.
中文:HaloGS提出了一种结合粗糙三角形几何与高斯外观基元的双重表示法,能在保持稳健几何结构的同时,实现跨场景的紧凑型照片级真实感渲染。
English: HaloGS introduces a dual representation combining coarse triangles for geometry and Gaussian primitives for appearance, achieving compact yet photorealistic rendering across diverse environments while maintaining robust geometric structure.

Authors:Isabelle Augenstein, Michiel Bakker, Tanmoy Chakraborty, David Corney, Emilio Ferrara, Iryna Gurevych, Scott Hale, Eduard Hovy, Heng Ji, Irene Larraz, Filippo Menczer, Preslav Nakov, Paolo Papotti, Dhruv Sahnan, Greta Warren, Giovanni Zagni
Title: Community Moderation and the New Epistemology of Fact Checking on Social Media
Abstract:
Social media platforms have traditionally relied on internal moderation teams and partnerships with independent fact-checking organizations to identify and flag misleading content. Recently, however, platforms including X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking -- Community Notes. If effectively scaled and governed, such crowd-checking initiatives have the potential to combat misinformation with increased scale and speed as successfully as community-driven efforts once did with spam. Nevertheless, general content moderation, especially for misinformation, is inherently more complex. Public perceptions of truth are often shaped by personal biases, political leanings, and cultural contexts, complicating consensus on what constitutes misleading content. This suggests that community efforts, while valuable, cannot replace the indispensable role of professional fact-checkers. Here we systemically examine the current approaches to misinformation detection across major platforms, explore the emerging role of community-driven moderation, and critically evaluate both the promises and challenges of crowd-checking at scale.
中文: 社交媒体平台正从传统审核转向社区驱动的辟谣机制,如社区笔记,虽能更高效打击虚假信息,但因公众认知的主观性和专业核查者的不可替代性而面临挑战。
English: Social media platforms are shifting from traditional moderation to community-driven fact-checking like Community Notes, which could combat misinformation more efficiently but face challenges due to subjective public perceptions and the irreplaceable role of professional fact-checkers.

Authors:Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey
Title: Structured Initialization for Vision Transformers
Abstract:
Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.
中文: 本文提出了一种新颖的初始化方法,无需改变架构即可将类似CNN的归纳偏置融入视觉Transformer中,显著提升了中小规模数据集的性能,同时在大规模基准测试中保持竞争力。
English: This paper introduces a novel initialization method that incorporates CNN-like inductive biases into Vision Transformers (ViTs) without architectural changes, significantly enhancing performance on small to medium-scale datasets while maintaining competitiveness on large-scale benchmarks.

Authors:Yu Wang, Junshu Dai, Yuchen Ying, Yuxuan Liang, Tongya Zheng, Mingli Song
Title: Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction
Abstract:
Human mobility prediction is crucial for applications ranging from location-based recommendations to urban planning, which aims to forecast users' next location visits based on historical trajectories. Despite the severe long-tailed distribution of locations, the problem of long-tailed mobility prediction remains largely underexplored. Existing long-tailed learning methods primarily focus on rebalancing the skewed distribution at the data, model, or class level, neglecting to exploit the spatiotemporal semantics of locations. To address this gap, we propose the first plug-and-play framework for long-tailed mobility prediction in an exploitation and exploration manner, named \textbf{A}daptive \textbf{LO}cation \textbf{H}ier\textbf{A}rchy learning (ALOHA). First, we construct city-tailored location hierarchy based on Large Language Models (LLMs) by exploiting Maslow's theory of human motivation to design Chain-of-Thought (CoT) prompts that captures spatiotemporal semantics. Second, we optimize the location hierarchy predictions by Gumbel disturbance and node-wise adaptive weights within the hierarchical tree structure. Experiments on state-of-the-art models across six datasets demonstrate the framework's consistent effectiveness and generalizability, which strikes a well balance between head and tail locations. Weight analysis and ablation studies reveal the optimization differences of each component for head and tail locations. Furthermore, in-depth analyses of hierarchical distance and case study demonstrate the effective semantic guidance from the location hierarchy. Our code will be made publicly available.
中文: 本研究提出ALOHA框架,通过利用大语言模型构建具有时空语义的位置层级,以探索性方式解决长尾移动预测问题,在多个数据集的实验中实现了头部与尾部位置预测的有效平衡。
English: The study introduces ALOHA, a plug-and-play framework that leverages LLM-constructed location hierarchies with spatiotemporal semantics to address long-tailed mobility prediction, demonstrating consistent effectiveness in balancing head and tail locations across multiple datasets.

Authors:Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang
Title: JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models
Abstract:
Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer's latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model's internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.
中文: 该研究提出JailBound框架,通过联合优化跨模态扰动来利用视觉语言模型中的潜在安全边界,相比现有方法显著提高了攻击成功率,揭示了关键的安全风险。
English: The study introduces JailBound, a novel jailbreak framework that exploits latent safety boundaries in Vision-Language Models by jointly optimizing cross-modal perturbations, achieving significantly higher attack success rates than existing methods and revealing critical security vulnerabilities.

Authors:Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo
Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Abstract:
Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
中文摘要:前瞻性反事实推理有助于预测金融市场未来动向,而FIN-FORCE基准通过评估大语言模型生成此类预测的能力,为自动化决策支持开辟了新途径。
English Summary: Forward counterfactual reasoning helps anticipate future financial market developments, and the FIN-FORCE benchmark enables automated evaluation of large language models for generating these insights to support decision-making.

Authors:Hossein Zaremehrjerdi, Shreyan Ganguly, Ashlyn Rairdin, Elizabeth Tranel, Benjamin Feuer, Juan Ignacio Di Salvo, Srikanth Panthulugiri, Hernan Torres Pacin, Victoria Moser, Sarah Jones, Joscif G Raigne, Yanben Shen, Heidi M. Dornath, Aditya Balu, Adarsh Krishnamurthy, Asheesh K Singh, Arti Singh, Baskar Ganapathysubramanian, Chinmay Hegde, Soumik Sarkar
Title: Towards Large Reasoning Models for Agriculture
Abstract:
Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/
中文: 农业决策需要细致推理,传统语言模型难以胜任,但大型推理模型在新基准AgReason上表现更优,而AgThoughts数据集则支持开发适用于消费级硬件的紧凑型AgThinker模型。
English: Agricultural decision-making requires nuanced reasoning that traditional language models struggle with, but large reasoning models show improved performance on the new AgReason benchmark, while the AgThoughts dataset enables the development of compact AgThinker models for consumer hardware.

Authors:Vignesh Kottayam Viswanathan, Akash Patel, Mario Alberto Valdes Saucedo, Sumeet Satpute, Christoforos Kanellakis, George Nikolakopoulos
Title: SPADE: Towards Scalable Path Planning Architecture on Actionable Multi-Domain 3D Scene Graphs
Abstract:
In this work, we introduce SPADE, a path planning framework designed for autonomous navigation in dynamic environments using 3D scene graphs. SPADE combines hierarchical path planning with local geometric awareness to enable collision-free movement in dynamic scenes. The framework bifurcates the planning problem into two: (a) solving the sparse abstract global layer plan and (b) iterative path refinement across denser lower local layers in step with local geometric scene navigation. To ensure efficient extraction of a feasible route in a dense multi-task domain scene graphs, the framework enforces informed sampling of traversable edges prior to path-planning. This removes extraneous information not relevant to path-planning and reduces the overall planning complexity over a graph. Existing approaches address the problem of path planning over scene graphs by decoupling hierarchical and geometric path evaluation processes. Specifically, this results in an inefficient replanning over the entire scene graph when encountering path obstructions blocking the original route. In contrast, SPADE prioritizes local layer planning coupled with local geometric scene navigation, enabling navigation through dynamic scenes while maintaining efficiency in computing a traversable route. We validate SPADE through extensive simulation experiments and real-world deployment on a quadrupedal robot, demonstrating its efficacy in handling complex and dynamic scenarios.
中文: SPADE是一种分层路径规划框架,通过结合全局路径规划和局部几何优化,实现在动态环境中的高效无碰撞自主导航。
English: SPADE is a hierarchical path planning framework that enables autonomous navigation in dynamic environments by combining global route planning with local geometric refinements, ensuring efficient and collision-free movement.

Authors:Yongjie Wang, Yibo Wang, Xin Zhou, Zhiqi Shen
Title: Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
Abstract:
Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.
中文摘要:本研究发现探针性能提升与大语言模型响应不确定性降低存在强相关性,表明高响应方差会扩大关键特征集从而增加探针解析难度,同时通过不确定性分析找到了模型表征与人类知识相契合的具体例证。
English Summary: This study demonstrates a strong correlation between improved probe performance and reduced LLM response uncertainty, revealing that higher response variance complicates feature interpretation and diminishes probe effectiveness, while also identifying instances where LLM representations align with human knowledge.

Authors:Qinfan Xiao, Ziyun Cui, Chi Zhang, Siqi Chen, Wen Wu, Andrew Thwaites, Alexandra Woolgar, Bowen Zhou, Chao Zhang
Title: BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals
Abstract:
Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer,the first tokenizer that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and model checkpoints will be released upon acceptance.
中文: 本文提出首个统一处理脑电图和脑磁图信号的基础模型BrainOmni,通过BrainTokenizer的传感器编码和自监督学习,在多项任务中展现出优于现有方法的性能及强大的跨设备泛化能力。
English: This paper introduces BrainOmni, the first foundation model that unifies EEG and MEG signal processing through BrainTokenizer's sensor encoding and self-supervised learning, demonstrating superior performance across diverse tasks and devices.

Authors:Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu
Title: Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
Abstract:
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.
中文摘要:DynToM基准测试表明,大型语言模型在追踪动态心理状态变化方面的表现比人类低44.7%,这暴露出当前模型在模拟人类认知动态演变方面存在根本性缺陷。
English Summary: The DynToM benchmark reveals that large language models significantly underperform humans by 44.7% in tracking dynamic mental state changes, exposing critical limitations in their ability to model evolving human cognition during social interactions.

Authors:Kentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe
Title: Differentiable K-means for Fully-optimized Discrete Token-based ASR
Abstract:
Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL features independently of downstream tasks, making them suboptimal for specific applications. This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. This approach enables the fine-tuning of the SSL parameters and learning weights for outputs from multiple SSL layers. Experiments were conducted with ASR as a downstream task. ASR accuracy successfully improved owing to the optimized tokens. The acquired tokens also exhibited greater purity of phonetic information, which were found to be useful even in speech resynthesis.
中文摘要:本文提出使用可微分k均值聚类优化自监督学习模型的离散标记,使其在下游任务中联合优化,从而提升自动语音识别精度并获得更纯净的语音特征,适用于语音重合成等应用。
English Summary: This paper introduces differentiable k-means clustering to optimize discrete tokens from self-supervised learning models for downstream tasks, improving ASR accuracy and enhancing phonetic purity for applications like speech resynthesis.

Authors:Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Title: Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders
Abstract:
The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.
中文: 本研究利用稀疏自编码器分析视觉模型如何内部表征ImageNet层次结构,发现DINOv2通过层级激活和逐层增强的类别标记信息,隐式编码了分类学关系。
English: This study uses Sparse Autoencoders to analyze how vision models internally represent the ImageNet hierarchy, revealing that DINOv2 implicitly encodes taxonomic relationships through layered activations and increasing class token information.

Authors:Dikshit Chauhan, Bapi Dutta, Indu Bala, Niki van Stein, Thomas Bäck, Anupam Yadav
Title: Evolutionary Computation and Large Language Models: A Survey of Methods, Synergies, and Applications
Abstract:
Integrating Large Language Models (LLMs) and Evolutionary Computation (EC) represents a promising avenue for advancing artificial intelligence by combining powerful natural language understanding with optimization and search capabilities. This manuscript explores the synergistic potential of LLMs and EC, reviewing their intersections, complementary strengths, and emerging applications. We identify key opportunities where EC can enhance LLM training, fine-tuning, prompt engineering, and architecture search, while LLMs can, in turn, aid in automating the design, analysis, and interpretation of ECs. The manuscript explores the synergistic integration of EC and LLMs, highlighting their bidirectional contributions to advancing artificial intelligence. It first examines how EC techniques enhance LLMs by optimizing key components such as prompt engineering, hyperparameter tuning, and architecture search, demonstrating how evolutionary methods automate and refine these processes. Secondly, the survey investigates how LLMs improve EC by automating metaheuristic design, tuning evolutionary algorithms, and generating adaptive heuristics, thereby increasing efficiency and scalability. Emerging co-evolutionary frameworks are discussed, showcasing applications across diverse fields while acknowledging challenges like computational costs, interpretability, and algorithmic convergence. The survey concludes by identifying open research questions and advocating for hybrid approaches that combine the strengths of EC and LLMs.
中文: 本文探讨大型语言模型与进化计算的协同融合,重点阐释进化计算如何优化大模型的关键环节,同时大模型自动化进化算法设计,并讨论了新兴应用与现存挑战。
English: This manuscript explores the synergistic integration of Large Language Models (LLMs) and Evolutionary Computation (EC), highlighting how EC enhances LLM optimization processes while LLMs automate EC design, with emerging applications and challenges discussed.

Authors:Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, wenjun wu, Bin Dai, Hongsheng Li, Si Liu
Title: UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning
Abstract:
Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.
中文:无人机正通过"飞行指令"任务实现语言引导的精细轨迹控制,利用模仿学习模拟专家飞行路径,并支持直接部署到现实环境,无需仿真到现实的转换。
English: Unmanned Aerial Vehicles (UAVs) are advancing toward language-guided fine-grained trajectory control through the Flying-on-a-Word task, utilizing imitation learning to mimic expert pilot paths and enabling direct real-world deployment without a sim-to-real gap.

Authors:Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, Linxi Fan
Title: FLARE: Robot Learning with Implicit World Modeling
Abstract:
We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $\textbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
中文摘要:FLARE是一种轻量级框架,通过将潜在表征与未来观测对齐来增强机器人策略学习,仅需极少的架构修改即可实现顶尖性能并显著提升泛化能力。
English Summary: FLARE is a lightweight framework that enhances robot policy learning by aligning latent representations with future observations, achieving state-of-the-art performance and improved generalization through minimal architectural adjustments.

Authors:Jixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie
Title: EASY: Emotion-aware Speaker Anonymization via Factorized Distillation
Abstract:
Emotion plays a significant role in speech interaction, conveyed through tone, pitch, and rhythm, enabling the expression of feelings and intentions beyond words to create a more personalized experience. However, most existing speaker anonymization systems employ parallel disentanglement methods, which only separate speech into linguistic content and speaker identity, often neglecting the preservation of the original emotional state. In this study, we introduce EASY, an emotion-aware speaker anonymization framework. EASY employs a novel sequential disentanglement process to disentangle speaker identity, linguistic content, and emotional representation, modeling each speech attribute in distinct subspaces through a factorized distillation approach. By independently constraining speaker identity and emotional representation, EASY minimizes information leakage, enhancing privacy protection while preserving original linguistic content and emotional state. Experimental results on the VoicePrivacy Challenge official datasets demonstrate that our proposed approach outperforms all baseline systems, effectively protecting speaker privacy while maintaining linguistic content and emotional state.
Chinese: 本研究提出EASY,一种情感感知的说话人匿名化框架,通过顺序解耦说话人身份、语言内容和情感表征,在保护隐私的同时保持语言内容和情感状态,实验证明其性能优于所有基线系统。
English: This study introduces EASY, an emotion-aware speaker anonymization framework that sequentially disentangles speaker identity, linguistic content, and emotional representation to enhance privacy protection while preserving both linguistic content and emotional state, outperforming all baseline systems in experiments.

Authors:Akash Patel, Mario A. V. Saucedo, Nikolaos Stathoulopoulos, Viswa Narayanan Sankaranarayanan, Ilias Tevetzidis, Christoforos Kanellakis, George Nikolakopoulos
Title: A Hierarchical Graph-Based Terrain-Aware Autonomous Navigation Approach for Complementary Multimodal Ground-Aerial Exploration
Abstract:
Autonomous navigation in unknown environments is a fundamental challenge in robotics, particularly in coordinating ground and aerial robots to maximize exploration efficiency. This paper presents a novel approach that utilizes a hierarchical graph to represent the environment, encoding both geometric and semantic traversability. The framework enables the robots to compute a shared confidence metric, which helps the ground robot assess terrain and determine when deploying the aerial robot will extend exploration. The robot's confidence in traversing a path is based on factors such as predicted volumetric gain, path traversability, and collision risk. A hierarchy of graphs is used to maintain an efficient representation of traversability and frontier information through multi-resolution maps. Evaluated in a real subterranean exploration scenario, the approach allows the ground robot to autonomously identify zones that are no longer traversable but suitable for aerial deployment. By leveraging this hierarchical structure, the ground robot can selectively share graph information on confidence-assessed frontier targets from parts of the scene, enabling the aerial robot to navigate beyond obstacles and continue exploration.
中文摘要:本文提出了一种基于分层图的框架,通过计算共享置信度指标,实现地面与空中机器人在未知环境中的协同自主导航探索。
English Summary: This paper introduces a hierarchical graph-based framework that enables coordinated ground-aerial robot exploration by computing shared confidence metrics for autonomous navigation in unknown environments.

Authors:Varun Raaghav, Dimitrios Bikos, Antonio Rago, Francesca Toni, Maria Charalambides
Title: Explainable Prediction of the Mechanical Properties of Composites with CNNs
Abstract:
Composites are amongst the most important materials manufactured today, as evidenced by their use in countless applications. In order to establish the suitability of composites in specific applications, finite element (FE) modelling, a numerical method based on partial differential equations, is the industry standard for assessing their mechanical properties. However, FE modelling is exceptionally costly from a computational viewpoint, a limitation which has led to efforts towards applying AI models to this task. However, in these approaches: the chosen model architectures were rudimentary, feed-forward neural networks giving limited accuracy; the studies focused on predicting elastic mechanical properties, without considering material strength limits; and the models lacked transparency, hindering trustworthiness by users. In this paper, we show that convolutional neural networks (CNNs) equipped with methods from explainable AI (XAI) can be successfully deployed to solve this problem. Our approach uses customised CNNs trained on a dataset we generate using transverse tension tests in FE modelling to predict composites' mechanical properties, i.e., Young's modulus and yield strength. We show empirically that our approach achieves high accuracy, outperforming a baseline, ResNet-34, in estimating the mechanical properties. We then use SHAP and Integrated Gradients, two post-hoc XAI methods, to explain the predictions, showing that the CNNs use the critical geometrical features that influence the composites' behaviour, thus allowing engineers to verify that the models are trustworthy by representing the science of composites.
中文: 本研究证明,结合可解释人工智能技术的卷积神经网络能够准确预测复合材料力学性能,其表现优于传统模型,同时通过对关键几何特征的透明化解读为工程验证提供可靠依据。
English: This study demonstrates that convolutional neural networks enhanced with explainable AI techniques can accurately predict composite materials' mechanical properties, outperforming traditional models while providing transparent insights into critical geometric features for engineering verification.

Authors:Wenjun Hou, Yi Cheng, Kaishuai Xu, Heng Li, Yan Hu, Wenjie Li, Jiang Liu
Title: RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration. To address this limitation, we propose Radar, a framework for enhancing radiology report generation with supplementary knowledge injection. Radar improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model's acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, Radar generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy.
Chinese: Radar框架通过系统整合大语言模型内部知识与外部检索补充信息,在多个数据集上实现了放射学报告生成的语言质量和临床准确性的双重提升,显著优于现有先进方法。
English: The proposed Radar framework enhances radiology report generation by synergistically integrating internal knowledge from large language models with externally retrieved supplementary information, achieving superior performance in both linguistic quality and clinical accuracy across multiple datasets.

Authors:Yanan Li, Fanxu Meng, Muhan Zhang, Shiai Zhu, Shangguang Wang, Mengwei Xu
Title: LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades
Abstract:
As Large Language Models (LLMs) are frequently updated, LoRA weights trained on earlier versions quickly become obsolete. The conventional practice of retraining LoRA weights from scratch on the latest model is costly, time-consuming, and environmentally detrimental, particularly as the diversity of LLMs and downstream tasks expands. This motivates a critical question: "How can we efficiently leverage existing LoRA weights to adapt to newer model versions?" To address this, we propose LoRASuite, a modular approach tailored specifically to various types of LLM updates. First, we compute a transfer matrix utilizing known parameters from both old and new LLMs. Next, we allocate corresponding layers and attention heads based on centered kernel alignment and cosine similarity metrics, respectively. A subsequent small-scale, skillful fine-tuning step ensures numerical stability. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods. Notably, on backbone LLMs such as MiniCPM and Qwen, LoRASuite even exceeds the performance of full-scale LoRA retraining, with average improvements of +1.4 and +6.6 points on math tasks, respectively. Additionally, LoRASuite significantly reduces memory consumption by 5.5 GB and computational time by 78.23%.
中文:LoRASuite通过模块化方法将现有LoRA权重高效适配至新版大语言模型,在显著降低内存和计算成本的同时,性能甚至优于完全重新训练。
English: LoRASuite efficiently adapts existing LoRA weights to new LLM versions through a modular approach, outperforming full retraining while reducing memory and computational costs significantly.

Authors:Ashish Gurung, Jionghao Lin, Zhongtian Huang, Conrad Borchers, Ryan S. Baker, Vincent Aleven, Kenneth R. Koedinger
Title: Starting Seatwork Earlier as a Valid Measure of Student Engagement
Abstract:
Prior work has developed a range of automated measures ("detectors") of student self-regulation and engagement from student log data. These measures have been successfully used to make discoveries about student learning. Here, we extend this line of research to an underexplored aspect of self-regulation: students' decisions about when to start and stop working on learning software during classwork. In the first of two analyses, we build on prior work on session-level measures (e.g., delayed start, early stop) to evaluate their reliability and predictive validity. We compute these measures from year-long log data from Cognitive Tutor for students in grades 8-12 (N = 222). Our findings show that these measures exhibit moderate to high month-to-month reliability (G > .75), comparable to or exceeding gaming-the-system behavior. Additionally, they enhance the prediction of final math scores beyond prior knowledge and gaming-the-system behaviors. The improvement in learning outcome predictions beyond time-on-task suggests they capture a broader motivational state tied to overall learning. The second analysis demonstrates the cross-system generalizability of these measures in i-Ready, where they predict state test scores for grade 7 students (N = 818). By leveraging log data, we introduce system-general naturally embedded measures that complement motivational surveys without extra instrumentation or disruption of instruction time. Our findings demonstrate the potential of session-level logs to mine valid and generalizable measures with broad applications in the predictive modeling of learning outcomes and analysis of learner self-regulation.
中文: 本研究通过分析学习软件中的会话级决策,扩展了学生自我调节的自动检测方法,证明了这些可靠且具预测性的指标能提升学习成果预测,并在不同教育系统中具有普适性。
English: This study extends automated detection of student self-regulation by analyzing session-level decisions in learning software, demonstrating reliable and predictive measures that enhance outcome predictions and generalize across educational systems.

Authors:Ajian Liu, Haocheng Yuan, Xiao Guo, Hui Ma, Wanyi Zhuang, Changtao Miao, Yan Hong, Chuanbiao Song, Jun Lan, Qi Chu, Tao Gong, Yanyan Liang, Weiqiang Wang, Jun Wan, Xiaoming Liu, Zhen Lei
Title: Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning
Abstract:
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively. However, isolated training of these two models significantly increases vulnerability towards unknown attacks, burdening deployment environments. The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors: (1) A benchmark that is sufficient for models to explore is lacking. Existing UAD datasets only contain limited attack types and samples, leading to the model's confined ability to address abundant advanced threats. In light of these, through an explainable hierarchical way, we propose the most extensive and sophisticated collection of forgery techniques available to date, namely UniAttackDataPlus. Our UniAttackData+ encompasses 2,875 identities and their 54 kinds of corresponding falsified samples, in a total of 697,347 videos. (2) The absence of a trustworthy classification criterion. Current methods endeavor to explore an arbitrary criterion within the same semantic space, which fails to exist when encountering diverse attacks. Thus, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces. Specifically, we construct a VP-Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts guiding the encoder to extract discriminative features at different levels in a coarse-to-fine manner. Finally, to help the model understand the classification criteria in visual space, we propose a DPI module to project the visual prompts to the text encoder to help obtain a more accurate semantics.
中文摘要:作者提出了迄今为止最全面的人脸伪造数据集UniAttackData+,并开发了一种基于视觉语言模型的分层提示调优框架,通过跨多语义空间的分层分类方法,解决了物理呈现攻击和数字深度伪造缺乏统一检测模型的问题。
English Summary: The authors introduce UniAttackData+, the most extensive face forgery dataset to date, and a novel Visual-Language Model-based framework to address the lack of unified detection for both physical presentation attacks and digital deepfakes by enabling hierarchical classification across multiple semantic spaces.

Authors:Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu
Title: Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Abstract:
Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.
中文: 本研究提出SVAD任务以探索视觉语言模型从无声视频推理音频描述的能力,并设计思维链微调策略,显著增强了模型在模态不匹配情况下的推理性能,有效解决了视频到音频任务中音频描述获取的难题。
English: This study introduces the SVAD task to explore vision-language models' ability to infer audio descriptions from silent videos and proposes a Chain-of-Thought fine-tuning strategy that significantly enhances modal-mismatch reasoning for both SVAD and video-to-audio tasks.

Authors:Hemanth Saratchandran, Simon Lucey
Title: Enhancing Transformers Through Conditioned Embedded Tokens
Abstract:
Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.
中文摘要:本研究揭示了Transformer注意力块存在的固有不良条件性问题影响训练效率,并提出通过条件化嵌入令牌的方法优化注意力机制条件,在多种应用中实现了更稳定的训练和性能提升。
English Summary: The study identifies inherent ill-conditioning in transformer attention blocks that impedes training efficiency and proposes a method using conditioned embedded tokens to enhance attention mechanism conditioning, achieving improved stability and performance across diverse applications.

Authors:Hemanth Saratchandran, Simon Lucey
Title: Enhancing Transformers Through Conditioned Embedded Tokens
Abstract:
Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.
中文摘要:本研究揭示了Transformer注意力块存在的固有不良条件性问题影响训练效率,并提出通过条件化嵌入令牌的方法优化注意力机制条件,在多种应用中实现了更稳定的训练和性能提升。
English Summary: The study identifies inherent ill-conditioning in transformer attention blocks that impedes training efficiency and proposes a method using conditioned embedded tokens to enhance attention mechanism conditioning, achieving improved stability and performance across diverse applications.

Authors:Zhengyi Luo, Chen Tessler, Toru Lin, Ye Yuan, Tairan He, Wenli Xiao, Yunrong Guo, Gal Chechik, Kris Kitani, Linxi Fan, Yuke Zhu
Title: Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning
Abstract:
Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.
Chinese: 感知灵巧控制(PDC)框架仅通过以自我为中心的视觉,使模拟人形机器人能够执行复杂的家务任务,并通过强化学习无需依赖特权状态信息即可习得主动搜索等涌现行为。
English: The Perceptive Dexterous Control (PDC) framework enables simulated humanoids to perform complex household tasks using only egocentric vision, learning emergent behaviors like active search through reinforcement learning without relying on privileged state information.

Authors:Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia
Title: VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
Abstract:
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).
Chinese: VisionReasoner通过统一的奖励机制和认知学习策略增强推理能力,在检测、分割和计数等十个视觉感知任务中均优于基准模型,展现出卓越的综合性能。
English: VisionReasoner is a unified framework that enhances reasoning capabilities through a shared reward mechanism and cognitive learning strategies, achieving superior performance across ten diverse visual perception tasks compared to baseline models.

Authors:Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia
Title: VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
Abstract:
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model. VisionReasoner generates a structured reasoning process before delivering the desired outputs responding to user queries. Human evaluation reveals the reasoning process of VisionReasoner is faithful and reliable even without annotated reasoning train data. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming the baseline Qwen2.5VL by relative margins of 29.1\% on COCO (detection), 22.1\% on ReasonSeg (segmentation), and 15.3\% on CountBench (counting).
Chinese: VisionReasoner通过统一的奖励机制和认知学习策略增强推理能力,在检测、分割和计数等十个视觉感知任务中均优于基准模型,展现出卓越的综合性能。
English: VisionReasoner is a unified framework that enhances reasoning capabilities through a shared reward mechanism and cognitive learning strategies, achieving superior performance across ten diverse visual perception tasks compared to baseline models.

Authors:Yian Zhao, Wanshi Xu, Ruochong Zheng, Pengchong Qiao, Chang Liu, Jie Chen
Title: iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
Abstract:
The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation. However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results. Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation. To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose interactive Segment-and-Manipulate 3D Gaussians (iSegMan), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view. To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (EIP), which innovatively exploits epipolar constraint for efficient and robust interaction matching. To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (VGV), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility. Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a Manipulation Toolbox to implement various functions on selected regions, enhancing the controllability, flexibility and practicality of scene manipulation. Extensive results on 3D scene manipulation and segmentation tasks fully demonstrate the significant advantages of iSegMan. Project page is available at https://zhao-yian.github.io/iSegMan.
中文摘要:提出的iSegMan框架通过简单的2D用户交互实现三维场景操控,采用创新的极线约束交互传播和基于可见性的高斯投票技术,无需场景特定训练即可完成高效分割。
English Summary: The proposed iSegMan framework enables interactive 3D scene manipulation through simple 2D user inputs, utilizing novel Epipolar-guided Interaction Propagation and Visibility-based Gaussian Voting techniques to achieve efficient segmentation without scene-specific training.

Authors:Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Title: MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging
Abstract:
Continual model merging integrates independently fine-tuned models sequentially without access to original training data, providing a scalable and efficient solution to continual learning. However, current methods still face critical challenges, notably parameter interference among tasks and limited adaptability to evolving test distributions. The former causes catastrophic forgetting of integrated tasks, while the latter hinders effective adaptation to new tasks. To address these, we propose MINGLE, a novel framework for test-time continual model merging, which leverages test-time adaptation using a small set of unlabeled test samples from the current task to dynamically guide the merging process. MINGLE employs a mixture-of-experts architecture composed of parameter-efficient, low-rank experts, enabling efficient adaptation and improving robustness to distribution shifts. To mitigate catastrophic forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations. This suppresses activations on old task inputs and preserves model behavior on past tasks. To further balance stability and adaptability, we design an Adaptive Relaxation Strategy, which dynamically adjusts the constraint strength based on interference signals captured during test-time adaptation. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, reduces forgetting significantly, and consistently surpasses previous state-of-the-art methods by 7-9\% on average across diverse task orders.
中文: 本文提出MINGLE框架,通过专家混合架构和零空间约束门控机制,在测试时持续合并模型中有效缓解参数冲突并适应分布变化,相比现有方法实现了7-9%的性能提升。
English: The paper introduces MINGLE, a novel framework for Test-Time Continual Model Merging that employs a mixture-of-experts architecture and Null-Space Constrained Gating to mitigate parameter interference and adapt to distribution shifts, achieving significant performance improvements over existing methods.

Authors:Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Jinho Choi, Tony Q. S. Quek, Seong-Lyun Kim
Title: Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Abstract:
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
Chinese: 提出的通信高效且不确定性感知的混合语言模型(CU-HLM)通过让小型语言模型仅在不确定性高时发送截断词汇分布,相比标准HLM实现了高达206倍的令牌吞吐量提升,同时保持97.4%的准确率。
English: The proposed communication-efficient and uncertainty-aware hybrid language model (CU-HLM) reduces transmission overhead by having the small language model send truncated vocabulary distributions only during high uncertainty, achieving up to 206× higher token throughput while maintaining 97.4% accuracy compared to standard HLM.

Authors:Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu
Title: MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
Abstract:
We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.
Chinese: MegaScale-MoE 是一个专为高效训练大规模专家混合模型而设计的系统,通过优化通信策略和重叠计算,在训练吞吐量上实现了比现有方法1.88倍的提升。
English: MegaScale-MoE is a production system designed to enhance the training efficiency of large-scale mixture-of-experts models by optimizing communication strategies and overlapping computation, achieving a 1.88× improvement in throughput compared to existing methods.

Authors:Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, Anna V. Kononova
Title: Reasoning Capabilities of Large Language Models on Dynamic Tasks
Abstract:
Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic tasks with open-source models. We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, an overly long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in areas like planning and spatial coordination, suggesting that large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while methods like Chain-of-thought improve multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.
中文: 大语言模型在动态环境中持续表现出局限性,策略性提示虽可缩小性能差距,但高级推理方法效果不稳定且未能实现真正的人类级涌现推理。
English: Large language models show persistent limitations in dynamic environments, where strategic prompting can narrow performance gaps but advanced reasoning methods yield unstable results and fail to achieve true emergent reasoning comparable to humans.

Authors:Yitao Zhu, Yuan Yin, Zhenrong Shen, Zihao Zhao, Haiyu Song, Sheng Wang, Dinggang Shen, Qian Wang
Title: UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System
Abstract:
The growing complexity and scale of visual model pre-training have made developing and deploying multi-task computer-aided diagnosis (CAD) systems increasingly challenging and resource-intensive. Furthermore, the medical imaging community lacks an open-source CAD platform to enable the rapid creation of efficient and extendable diagnostic models. To address these issues, we propose UniCAD, a unified architecture that leverages the robust capabilities of pre-trained vision foundation models to seamlessly handle both 2D and 3D medical images while requiring only minimal task-specific parameters. UniCAD introduces two key innovations: (1) Efficiency: A low-rank adaptation strategy is employed to adapt a pre-trained visual model to the medical image domain, achieving performance on par with fully fine-tuned counterparts while introducing only 0.17% trainable parameters. (2) Plug-and-Play: A modular architecture that combines a frozen foundation model with multiple plug-and-play experts, enabling diverse tasks and seamless functionality expansion. Building on this unified CAD architecture, we establish an open-source platform where researchers can share and access lightweight CAD experts, fostering a more equitable and efficient research ecosystem. Comprehensive experiments across 12 diverse medical datasets demonstrate that UniCAD consistently outperforms existing methods in both accuracy and deployment efficiency. The source code and project page are available at https://mii-laboratory.github.io/UniCAD/.
中文:UniCAD是一种统一架构,利用预训练视觉模型高效处理2D和3D医学图像,仅需极少量任务特定参数,在精度和部署效率上均优于现有方法,并建立了开源平台促进协作研究。
English: UniCAD is a unified architecture that utilizes pre-trained vision models to efficiently handle 2D and 3D medical images with minimal task-specific parameters, outperforming existing methods in accuracy and deployment efficiency while establishing an open-source platform for collaborative research.

Authors:Zhenzhou Jin, Li You, Xiang-Gen Xia, Xiqi Gao
Title: EnvCDiff: Joint Refinement of Environmental Information and Channel Fingerprints via Conditional Generative Diffusion Model
Abstract:
The paradigm shift from environment-unaware communication to intelligent environment-aware communication is expected to facilitate the acquisition of channel state information for future wireless communications. Channel Fingerprint (CF), as an emerging enabling technology for environment-aware communication, provides channel-related knowledge for potential locations within the target communication area. However, due to the limited availability of practical devices for sensing environmental information and measuring channel-related knowledge, most of the acquired environmental information and CF are coarse-grained, insufficient to guide the design of wireless transmissions. To address this, this paper proposes a deep conditional generative learning approach, namely a customized conditional generative diffusion model (CDiff). The proposed CDiff simultaneously refines environmental information and CF, reconstructing a fine-grained CF that incorporates environmental information, referred to as EnvCF, from its coarse-grained counterpart. Experimental results show that the proposed approach significantly improves the performance of EnvCF construction compared to the baselines.
Chinese: 本文提出了一种名为CDiff的深度条件生成扩散模型,通过同时优化环境信息和信道指纹,从粗粒度数据中重构出融合环境信息的精细EnvCF,实验结果显示其构建性能显著优于基线方法。
English: This paper introduces a deep conditional generative diffusion model called CDiff, which refines coarse environmental data and Channel Fingerprint to reconstruct a fine-grained, environment-integrated EnvCF, significantly outperforming baseline methods in construction performance.

Authors:Zhenzhou Jin, Li You, Xudong Li, Zhen Gao, Yuanwei Liu, Xiang-Gen Xia, Xiqi Gao
Title: Channel Fingerprint Construction for Massive MIMO: A Deep Conditional Generative Approach
Abstract:
Accurate channel state information (CSI) acquisition for massive multiple-input multiple-output (MIMO) systems is essential for future mobile communication networks. Channel fingerprint (CF), also referred to as channel knowledge map, is a key enabler for intelligent environment-aware communication and can facilitate CSI acquisition. However, due to the cost limitations of practical sensing nodes and test vehicles, the resulting CF is typically coarse-grained, making it insufficient for wireless transceiver design. In this work, we introduce the concept of CF twins and design a conditional generative diffusion model (CGDM) with strong implicit prior learning capabilities as the computational core of the CF twin to establish the connection between coarse- and fine-grained CFs. Specifically, we employ a variational inference technique to derive the evidence lower bound (ELBO) for the log-marginal distribution of the observed fine-grained CF conditioned on the coarse-grained CF, enabling the CGDM to learn the complicated distribution of the target data. During the denoising neural network optimization, the coarse-grained CF is introduced as side information to accurately guide the conditioned generation of the CGDM. To make the proposed CGDM lightweight, we further leverage the additivity of network layers and introduce a one-shot pruning approach along with a multi-objective knowledge distillation technique. Experimental results show that the proposed approach exhibits significant improvement in reconstruction performance compared to the baselines. Additionally, zero-shot testing on reconstruction tasks with different magnification factors further demonstrates the scalability and generalization ability of the proposed approach.
中文摘要:本文提出信道指纹孪生概念,采用条件生成扩散模型将粗粒度信道指纹转化为细粒度版本,在大规模MIMO系统中实现了卓越的重建性能和泛化能力。
English Summary: This paper introduces a CF twin concept using a conditional generative diffusion model to convert coarse-grained channel fingerprints into fine-grained ones, achieving superior reconstruction and generalization in massive MIMO systems.

Authors:Rong Kang, Yanbin Chen, Ye Liu, Fuxin Jiang, Qingshuo Li, Miao Ma, Jian Liu, Guangliang Zhao, Tieying Zhang, Jianjun Chen, Lei Zhang
Title: ABase: the Multi-Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large-scale Cloud Environments
Abstract:
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the impact of caching on performance isolation, as cache hits alter request execution and resource consumption, leading to inaccurate traffic control; (2) the dynamic changes in traffic, with changes in tenant traffic trends causing throttling or resource wastage, and changes in access distribution causing hot key pressure or cache hit ratio drops; and (3) the imbalanced layout of data nodes due to tenants' diverse resource requirements, leading to low resource utilization. To address these challenges, we introduce ABase, a multi-tenant NoSQL serverless database developed at ByteDance. ABase introduces a two-layer caching mechanism with a cache-aware isolation mechanism to ensure accurate resource consumption estimates. Furthermore, ABase employs a predictive autoscaling policy to dynamically adjust resources in response to tenant traffic changes and a multi-resource rescheduling algorithm to balance resource utilization across data nodes. With these innovations, ABase has successfully served ByteDance's large-scale cloud environment, supporting a total workload that has achieved a peak QPS of over 13 billion and total storage exceeding 1 EB.
中文摘要:多租户NoSQL数据库在性能隔离、动态流量管理和资源利用方面面临挑战,ABase通过双层缓存机制、预测性自动扩展和多资源重调度技术有效解决这些问题,支撑了大规模工作负载。
English Summary: Multi-tenant NoSQL databases face challenges in performance isolation, dynamic traffic management, and resource utilization, which ABase addresses through a two-layer caching mechanism, predictive autoscaling, and multi-resource rescheduling to support massive workloads.

Authors:Payal Varshney, Adriano Lucieri, Christoph Balada, Andreas Dengel, Sheraz Ahmed
Title: Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering
Abstract:
Concept-based explanations have emerged as an effective approach within Explainable Artificial Intelligence, enabling interpretable insights by aligning model decisions with human-understandable concepts. However, existing methods rely on computationally intensive procedures and struggle to efficiently capture complex, semantic concepts. Recently, the Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT) framework, introduced by Varshney et al. (2025), attempts to identify concepts via dimension-wise traversal of the latent space of a Variational Autoencoder trained on counterfactual trajectories. Extending the CDCT framework, this work introduces Concept Directions via Latent Clustering (CDLC), which extracts global, class-specific concept directions by clustering latent difference vectors derived from factual and diffusion-generated counterfactual image pairs. CDLC substantially reduces computational complexity by eliminating the exhaustive latent dimension traversal required in CDCT and enables the extraction of multidimensional semantic concepts encoded across the latent dimensions. This approach is validated on a real-world skin lesion dataset, demonstrating that the extracted concept directions align with clinically recognized dermoscopic features and, in some cases, reveal dataset-specific biases or unknown biomarkers. These results highlight that CDLC is interpretable, scalable, and applicable across high-stakes domains and diverse data modalities.
中文摘要:CDLC框架通过潜在聚类提取类别特定的语义概念,显著降低了计算复杂度,在保持可解释性的同时有效识别医学数据中的临床特征,提升了概念解释方法的实用性和扩展性。
English Summary: The CDLC framework enhances concept-based explainable AI by efficiently extracting class-specific semantic concepts through latent clustering, significantly reducing computational costs while maintaining interpretability and revealing clinically relevant features in medical datasets.

Authors:Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen, Ke Wang, Keyu Pan, Kun Zhang, Kunchang Li, Lanxin Li, Lei Li, Lei Shi, Li Han, Liang Xiang, Liangqiang Chen, Lin Chen, Lin Li, Lin Yan, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Wu, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Ru Zhang, Rui Wang, Rui Yang, Rui Zhao, Shaoqiang Xu, Shihao Liang, Shipeng Yan, Shu Zhong, Shuaishuai Cao, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Songhua Cai, Tenglong Ao, Tianhao Yang, Tingting Zhang, Wanjun Zhong, Wei Jia, Wei Weng, Weihao Yu, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xiao Li, Xiaolei Zhu, Xiaoying Jia, Xijin Zhang, Xin Liu, Xinchen Zhang, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuefeng Xiao, Xujing Li, Yan Wu, Yawei Wen, Yifan Du, Yihao Zhang, Yining Ye, Yonghui Wu, Yu Liu, Yu Yue, Yufeng Zhou, Yufeng Yuan, Yuhang Xu, Yuhong Yang, Yun Zhang, Yunhao Fang, Yuntao Li, Yurui Ren, Yuwen Xiong, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song
Title: Seed1.5-VL Technical Report
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
中文: Seed1.5-VL 是一款结构紧凑但功能强大的视觉语言模型,在多项基准测试中达到领先水平,并在智能体任务和多模态推理方面表现卓越。
English: Seed1.5-VL is a compact yet powerful vision-language model that achieves state-of-the-art performance across numerous benchmarks and excels in agent-centric tasks and multimodal reasoning.

Authors:Zhenzhou Jin, Li You, Derrick Wing Kwan Ng, Xiang-Gen Xia, Xiqi Gao
Title: Near-Field Channel Estimation for XL-MIMO: A Deep Generative Model Guided by Side Information
Abstract:
This paper investigates the near-field (NF) channel estimation (CE) for extremely large-scale multiple-input multiple-output (XL-MIMO) systems. Considering the pronounced NF effects in XL-MIMO communications, we first establish a joint angle-distance (AD) domain-based spherical-wavefront physical channel model that captures the inherent sparsity of XL-MIMO channels. Leveraging the channel's sparsity in the joint AD domain, the CE is approached as a task of reconstructing sparse signals. Anchored in this framework, we first propose a compressed sensing algorithm to acquire a preliminary channel estimate. Harnessing the powerful implicit prior learning capability of generative artificial intelligence (GenAI), we further propose a GenAI-based approach to refine the estimated channel. Specifically, we introduce the preliminary estimated channel as side information, and derive the evidence lower bound (ELBO) of the log-marginal distribution of the target NF channel conditioned on the preliminary estimated channel, which serves as the optimization objective for the proposed generative diffusion model (GDM). Additionally, we introduce a more generalized version of the GDM, the non-Markovian GDM (NM-GDM), to accelerate the sampling process, achieving an approximately tenfold enhancement in sampling efficiency. Experimental results indicate that the proposed approach is capable of offering substantial performance gain in CE compared to existing benchmark schemes within NF XL-MIMO systems. Furthermore, our approach exhibits enhanced generalization capabilities in both the NF or far-field (FF) regions.
中文: 本文针对超大规模MIMO系统的近场信道估计,提出了一种基于生成式扩散模型的方法,通过结合压缩感知和生成式人工智能技术,在近场和远场区域均实现了优越的性能和泛化能力。
English: This paper proposes a generative diffusion model-based approach for near-field channel estimation in XL-MIMO systems, leveraging compressed sensing and generative AI to achieve superior performance and generalization across both near-field and far-field regions.

Authors:Tobias Preintner, Weixuan Yuan, Qi Huang, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein
Title: Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Abstract:
Combining natural language and geometric shapes is an emerging research area with multiple applications in robotics and language-assisted design. A crucial task in this domain is object referent identification, which involves selecting a 3D object given a textual description of the target. Variability in language descriptions and spatial relationships of 3D objects makes this a complex task, increasing the need to better understand the behavior of neural network models in this domain. However, limited research has been conducted in this area. Specifically, when a model makes an incorrect prediction despite being provided with a seemingly correct object description, practitioners are left wondering: "Why is the model wrong?". In this work, we present a method answering this question by generating counterfactual examples. Our method takes a misclassified sample, which includes two objects and a text description, and generates an alternative yet similar formulation that would have resulted in a correct prediction by the model. We have evaluated our approach with data from the ShapeTalk dataset along with three distinct models. Our counterfactual examples maintain the structure of the original description, are semantically similar and meaningful. They reveal weaknesses in the description, model bias and enhance the understanding of the models behavior. Theses insights help practitioners to better interact with systems as well as engineers to improve models.
中文: 本研究提出一种生成反事实示例的方法,通过构建能导致正确预测的替代文本描述来解释三维物体指代识别中的模型错误,从而揭示模型偏差并提升可解释性。
English: This study introduces a method to generate counterfactual examples that explain model errors in 3D object referent identification by creating alternative text descriptions that would lead to correct predictions, revealing model biases and improving interpretability.

Authors:Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li
Title: Understanding Stragglers in Large Model Training Using What-if Analysis
Abstract:
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
中文: 本研究基于字节跳动大语言模型训练集群的五个月追踪数据,通过假设分析探究了训练减速现象的发生频率、影响程度、时空规律及硬件故障之外的复杂成因。
English: This study analyzes straggler issues in large language model training through a five-month trace from ByteDance's cluster, using what-if analysis to examine their frequency, impact, patterns, and root causes beyond simple hardware failures.

Authors:José Gonçalves, Miguel Silva, Eva Maia, Isabel Praça
Title: Enhancing Large Language Models with Faster Code Preprocessing for Vulnerability Detection
Abstract:
The application of Artificial Intelligence has become a powerful approach to detecting software vulnerabilities. However, effective vulnerability detection relies on accurately capturing the semantic structure of code and its contextual relationships. Given that the same functionality can be implemented in various forms, a preprocessing tool that standardizes code representation is important. This tool must be efficient, adaptable across programming languages, and capable of supporting new transformations. To address this challenge, we build on the existing SCoPE framework and introduce SCoPE2, an enhanced version with improved performance. We compare both versions in terms of processing time and memory usage and evaluate their impact on a Large Language Model (LLM) for vulnerability detection. Our results show a 97.3\% reduction in processing time with SCoPE2, along with an improved F1-score for the LLM, solely due to the refined preprocessing approach.
中文: SCoPE2在SCoPE框架基础上大幅优化,通过改进代码预处理方法,使处理时间减少97.3%,并提升了基于大语言模型的漏洞检测F1分数。
English: SCoPE2 significantly enhances the SCoPE framework, reducing processing time by 97.3% and improving the F1-score for LLM-based vulnerability detection through refined code preprocessing.

Authors:Jinke Tang, Li You, Xinrui Gong, Chenjie Xie, Xiqi Gao, Xiang-Gen Xia, Xueyuan Shi
Title: Statistical CSI Acquisition for Multi-frequency Massive MIMO Systems
Abstract:
Multi-frequency massive multi-input multi-output (MIMO) communication is a promising strategy for both 5G and future 6G systems, ensuring reliable transmission while enhancing frequency resource utilization. Statistical channel state information (CSI) has been widely adopted in multi-frequency massive MIMO transmissions to reduce overhead and improve transmission performance. In this paper, we propose efficient and accurate methods for obtaining statistical CSI in multi-frequency massive MIMO systems. First, we introduce a multi-frequency massive MIMO channel model and analyze the mapping relationship between two types of statistical CSI, namely the angular power spectrum (APS) and the spatial covariance matrix, along with their correlation across different frequency bands. Next, we propose an autoregressive (AR) method to predict the spatial covariance matrix of any frequency band based on that of another frequency band. Furthermore, we emphasize that channels across different frequency bands share similar APS characteristics. Leveraging the maximum entropy (ME) criterion, we develop a low-complexity algorithm for high-resolution APS estimation. Simulation results validate the effectiveness of the AR-based covariance prediction method and demonstrate the high-resolution estimation capability of the ME-based approach. Furthermore, we demonstrate the effectiveness of multi-frequency cooperative transmission by applying the proposed methods to obtain statistical CSI from low-frequency bands and utilizing it for high-frequency channel transmission. This approach significantly enhances high-frequency transmission performance while effectively reducing system overhead.
中文: 本文针对多频大规模MIMO系统提出了自回归空间协方差预测方法和基于最大熵的高分辨率角功率谱估计算法,在降低系统开销的同时显著提升了高频传输性能。
English: This paper introduces efficient methods for multi-frequency massive MIMO systems, including an autoregressive model for spatial covariance prediction and a maximum entropy-based algorithm for high-resolution angular power spectrum estimation, which enhance transmission performance while reducing overhead.

Authors:Jinke Tang, Xiqi Gao, Li You, Ding Shi, Jiyuan Yang, Xiang-Gen Xia, Xinwei Zhao, Peigang Jiang
Title: Massive MIMO-OFDM Channel Acquisition with Time-Frequency Phase-Shifted Pilots
Abstract:
In this paper, we propose a channel acquisition approach with time-frequency phase-shifted pilots (TFPSPs) for massive multi-input multi-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. We first present a triple-beam (TB) based channel tensor model, allowing for the representation of the space-frequency-time (SFT) domain channel as the product of beam matrices and the TB domain channel tensor. By leveraging the specific characteristics of TB domain channels, we develop TFPSPs, where distinct pilot signals are simultaneously transmitted in the frequency and time domains. Then, we present the optimal TFPSP design and provide the corresponding pilot scheduling algorithm. Further, we propose a tensor-based information geometry approach (IGA) to estimate the TB domain channel tensors. Leveraging the specific structure of beam matrices and the properties of TFPSPs, we propose a low-complexity implementation of the tensor-based IGA. We validate the efficiency of our proposed channel acquisition approach through extensive simulations. Simulation results demonstrate the superior performance of our approach. The proposed approach can effectively suppress inter-UT interference with low complexity and limited pilot overhead, thereby enhancing channel estimation performance. Particularly in scenarios with a large number of UTs, the channel acquisition method outperforms existing approaches by reducing the normalized mean square error (NMSE) by more than 8 dB.
中文: 本文提出了一种采用时频相移导频的大规模MIMO-OFDM系统信道获取方法,通过基于张量的信息几何方法,在降低复杂度和导频开销的同时,能有效抑制用户间干扰,使多用户场景下的信道估计归一化均方误差降低超过8分贝。
English: This paper introduces a channel acquisition method using time-frequency phase-shifted pilots for massive MIMO-OFDM systems, which employs a tensor-based information geometry approach to effectively reduce interference and pilot overhead while improving estimation accuracy by over 8 dB NMSE in high-user scenarios.

Authors:Jie Cao, Chloe Qianhui Zhao, Xian Chen, Shuman Wang, Christian Schunn, Kenneth R. Koedinger, Jionghao Lin
Title: From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation
Abstract:
Producing large volumes of high-quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies-particularly Large Language Models (LLMs)-show great potential. However, current LLM-based research still shows room for improvement in terms of feedback quality. Our study proposed a multi-agent approach performing "generation, evaluation, and regeneration" (G-E-RG) to further enhance feedback quality. In the first-generation phase, six methods were adopted, combining three feedback theoretical frameworks and two prompt methods: zero-shot and retrieval-augmented generation with chain-of-thought (RAG_CoT). The results indicated that, compared to first-round feedback, G-E-RG significantly improved final feedback across six methods for most dimensions. Specifically:(1) Evaluation accuracy for six methods increased by 3.36% to 12.98% (p<0.001); (2) The proportion of feedback containing four effective components rose from an average of 27.72% to an average of 98.49% among six methods, sub-dimensions of providing critiques, highlighting strengths, encouraging agency, and cultivating dialogue also showed great enhancement (p<0.001); (3) There was a significant improvement in most of the feature values (p<0.001), although some sub-dimensions (e.g., strengthening the teacher-student relationship) still require further enhancement; (4) The simplicity of feedback was effectively enhanced (p<0.001) for three methods.
中文摘要:本研究提出的多智能体G-E-RG方法通过整合理论框架与提示策略,在六个反馈维度上显著提升了评估准确性、有效反馈成分占比及简洁性,有效解决了自动化反馈的质量优化问题。
English Summary: This study introduces a multi-agent G-E-RG approach that significantly enhances feedback quality by integrating theoretical frameworks and prompt methods, showing marked improvements in evaluation accuracy, feedback components, and simplicity across multiple dimensions.

Authors:Chloe Qianhui Zhao, Jie Cao, Eason Chen, Kenneth R. Koedinger, Jionghao Lin
Title: SlideItRight: Using AI to Find Relevant Slides and Provide Feedback for Open-Ended Questions
Abstract:
Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' learning. Hence, it is important to explore the potential of Artificial Intelligence (AI) in feedback generation from and to different modalities. Our study leverages Large Language Models (LLMs) for textual feedback with the supplementary guidance from other modality - relevant lecture slide retrieved from the slides hub. Through an online crowdsourcing study (N=91), this study investigates learning gains and student perceptions using a 2x2 design (i.e., human feedback vs. AI feedback and with vs. without relevant slide), evaluating the clarity, engagement, perceived effectiveness, and reliability) of AI-facilitated multimodal feedback. We observed significant pre-to-post learning gains across all conditions. However, the differences in these gains were not statistically significant between conditions. The post-survey revealed that students found the slide feedback helpful in their learning process, though they reported difficulty in understanding it. Regarding the AI-generated open-ended feedback, students considered it personalized and relevant to their responses, but they expressed lower trust in the AI feedback compared to human-generated feedback.
中文摘要:本研究利用大型语言模型和补充幻灯片探索人工智能生成的多模态反馈,发现尽管学生在所有条件下均取得学习进步,但相较于人工反馈,他们对AI反馈的信任度较低,尽管认为其具有相关性和个性化特点。
English Summary: This study explores AI-generated multimodal feedback using LLMs and lecture slides, finding that while students achieved learning gains across all conditions, they trusted AI feedback less than human feedback despite its perceived relevance and personalization.

Authors:Chloe Qianhui Zhao, Jie Cao, Eason Chen, Kenneth R. Koedinger, Jionghao Lin
Title: SlideItRight: Using AI to Find Relevant Slides and Provide Feedback for Open-Ended Questions
Abstract:
Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' learning. Hence, it is important to explore the potential of Artificial Intelligence (AI) in feedback generation from and to different modalities. Our study leverages Large Language Models (LLMs) for textual feedback with the supplementary guidance from other modality - relevant lecture slide retrieved from the slides hub. Through an online crowdsourcing study (N=91), this study investigates learning gains and student perceptions using a 2x2 design (i.e., human feedback vs. AI feedback and with vs. without relevant slide), evaluating the clarity, engagement, perceived effectiveness, and reliability) of AI-facilitated multimodal feedback. We observed significant pre-to-post learning gains across all conditions. However, the differences in these gains were not statistically significant between conditions. The post-survey revealed that students found the slide feedback helpful in their learning process, though they reported difficulty in understanding it. Regarding the AI-generated open-ended feedback, students considered it personalized and relevant to their responses, but they expressed lower trust in the AI feedback compared to human-generated feedback.
中文摘要:本研究利用大型语言模型和补充幻灯片探索人工智能生成的多模态反馈,发现尽管学生在所有条件下均取得学习进步,但相较于人工反馈,他们对AI反馈的信任度较低,尽管认为其具有相关性和个性化特点。
English Summary: This study explores AI-generated multimodal feedback using LLMs and lecture slides, finding that while students achieved learning gains across all conditions, they trusted AI feedback less than human feedback despite its perceived relevance and personalization.

Authors:Wei Wei, Zheng Lin, Tao Li, Xuanheng Li, Xianhao Chen
Title: Pipelining Split Learning in Multi-hop Edge Networks
Abstract:
To support large-scale model training, split learning (SL) enables multiple edge devices/servers to share the intensive training workload. However, most existing works on SL focus solely on two-tier model splitting. Moreover, while some recent works have investigated the model splitting and placement problems for multi-hop SL, these solutions fail to overcome the resource idleness issue, resulting in significant network idle time. In this work, we propose a pipelined SL scheme by addressing the joint optimization problem of model splitting and placement (MSP) in multi-hop edge networks. By applying pipeline parallelism to SL, we identify that the MSP problem can be mapped to a problem of minimizing the weighted sum of a bottleneck cost function (min-max) and a linear cost function (min-sum). Based on graph theory, we devise a bottleneck-aware shortest-path algorithm to obtain the optimal solution. Besides, given the MSP outcomes, we also derive the closed-form solution to the micro-batch size in the pipeline. Finally, we develop an alternating optimization algorithm of MSP and micro-batch size to solve the joint optimization problem to minimize the end-to-end training latency. Extensive simulations have demonstrated the significant advantages of our algorithm compared to existing benchmarks without pipeline parallelism.
Chinese: 本文提出了一种流水线分割学习方案,通过在多跳边缘网络中联合优化模型分割与放置,利用瓶颈感知算法和微批次大小优化解决资源闲置问题,从而显著降低端到端训练延迟。
English: This paper introduces a pipelined split learning scheme that optimizes model splitting and placement in multi-hop edge networks to minimize training latency by addressing resource idleness through a bottleneck-aware algorithm and micro-batch size optimization.

Authors:Jiacheng Wang, Le Liang, Hao Ye, Chongtao Guo, Shi Jin
Title: Small-Scale-Fading-Aware Resource Allocation in Wireless Federated Learning
Abstract:
Judicious resource allocation can effectively enhance federated learning (FL) training performance in wireless networks by addressing both system and statistical heterogeneity. However, existing strategies typically rely on block fading assumptions, which overlooks rapid channel fluctuations within each round of FL gradient uploading, leading to a degradation in FL training performance. Therefore, this paper proposes a small-scale-fading-aware resource allocation strategy using a multi-agent reinforcement learning (MARL) framework. Specifically, we establish a one-step convergence bound of the FL algorithm and formulate the resource allocation problem as a decentralized partially observable Markov decision process (Dec-POMDP), which is subsequently solved using the QMIX algorithm. In our framework, each client serves as an agent that dynamically determines spectrum and power allocations within each coherence time slot, based on local observations and a reward derived from the convergence analysis. The MARL setting reduces the dimensionality of the action space and facilitates decentralized decision-making, enhancing the scalability and practicality of the solution. Experimental results demonstrate that our QMIX-based resource allocation strategy significantly outperforms baseline methods across various degrees of statistical heterogeneity. Additionally, ablation studies validate the critical importance of incorporating small-scale fading dynamics, highlighting its role in optimizing FL performance.
中文摘要:本文提出了一种基于QMIX的多智能体强化学习框架,通过考虑小尺度衰落动态优化联邦学习中的资源分配,在不同统计异质性条件下显著优于基准方法,有效提升了训练性能。
English Summary: This paper introduces a multi-agent reinforcement learning framework using QMIX to optimize resource allocation in federated learning by accounting for small-scale fading, significantly improving training performance over baseline methods under varying statistical heterogeneity.

Authors:Huibin Zhou, Xinrui Gong, Christos G. Tsinos, Li You, Xiqi Gao, Björn Ottersten
Title: GNN-enabled Precoding for Massive MIMO LEO Satellite Communications
Abstract:
Low Earth Orbit (LEO) satellite communication is a critical component in the development of sixth generation (6G) networks. The integration of massive multiple-input multiple-output (MIMO) technology is being actively explored to enhance the performance of LEO satellite communications. However, the limited power of LEO satellites poses a significant challenge in improving communication energy efficiency (EE) under constrained power conditions. Artificial intelligence (AI) methods are increasingly recognized as promising solutions for optimizing energy consumption while enhancing system performance, thus enabling more efficient and sustainable communications. This paper proposes approaches to address the challenges associated with precoding in massive MIMO LEO satellite communications. First, we introduce an end-to-end graph neural network (GNN) framework that effectively reduces the computational complexity of traditional precoding methods. Next, we introduce a deep unfolding of the Dinkelbach algorithm and the weighted minimum mean square error (WMMSE) approach to achieve enhanced EE, transforming iterative optimization processes into a structured neural network, thereby improving convergence speed and computational efficiency. Furthermore, we incorporate the Taylor expansion method to approximate matrix inversion within the GNN, enhancing both the interpretability and performance of the proposed method. Numerical experiments demonstrate the validity of our proposed method in terms of complexity and robustness, achieving significant improvements over state-of-the-art methods.
Chinese: 本文提出了基于人工智能的方法,如图神经网络框架和深度展开技术,以提升6G网络中大规模MIMO低地球轨道卫星通信的能效并降低计算复杂度。
English: This paper introduces AI-driven methods, including a graph neural network framework and deep unfolding techniques, to enhance energy efficiency and reduce computational complexity in massive MIMO LEO satellite communications for 6G networks.

Authors:Luis Miguel Vieira da Silva, Aljosha Köcher, Nicolas König, Felix Gehlhoff, Alexander Fay
Title: Capability-Driven Skill Generation with LLMs: A RAG-Based Approach for Reusing Existing Libraries and Interfaces
Abstract:
Modern automation systems increasingly rely on modular architectures, with capabilities and skills as one solution approach. Capabilities define the functions of resources in a machine-readable form and skills provide the concrete implementations that realize those capabilities. However, the development of a skill implementation conforming to a corresponding capability remains a time-consuming and challenging task. In this paper, we present a method that treats capabilities as contracts for skill implementations and leverages large language models to generate executable code based on natural language user input. A key feature of our approach is the integration of existing software libraries and interface technologies, enabling the generation of skill implementations across different target languages. We introduce a framework that allows users to incorporate their own libraries and resource interfaces into the code generation process through a retrieval-augmented generation architecture. The proposed method is evaluated using an autonomous mobile robot controlled via Python and ROS 2, demonstrating the feasibility and flexibility of the approach.
中文: 本文提出一种方法,将能力作为契约并利用大语言模型根据自然语言输入生成可执行代码,通过整合现有软件库和接口技术,实现跨编程语言的灵活技能实现。
English: This paper introduces a method that uses capabilities as contracts and leverages large language models to generate executable code from natural language input, integrating existing software libraries and interfaces for flexible skill implementation across different programming languages.

Authors:Max Qiushi Lin, Jincheng Mei, Matin Aghaei, Michael Lu, Bo Dai, Alekh Agarwal, Dale Schuurmans, Csaba Szepesvari, Sharan Vaswani
Title: Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
Abstract:
Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as $\texttt{Lin-SPG}$) and demonstrate that the approximation error is irrelevant to the algorithm's global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of $\texttt{Lin-SPG}$. Under these feature conditions, we prove that $T$ iterations of $\texttt{Lin-SPG}$ with a problem-specific learning rate result in an $O(1/T)$ convergence to the optimal policy. Furthermore, we prove that $\texttt{Lin-SPG}$ with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.
中文: 采用线性函数近似的策略梯度方法可在满足特定特征条件下实现全局收敛至最优策略,其收敛性与近似误差无关,且无论使用特定学习率还是任意常数学习率均能保证渐近收敛性。
English: Policy gradient methods with linear function approximation can achieve global convergence to the optimal policy regardless of approximation error, requiring only specific feature conditions and enabling both asymptotic convergence and an O(1/T) convergence rate with appropriate learning rates.

Authors:Mario A. V. Saucedo, Vignesh Kottayam Viswanathan, Christoforos Kanellakis, George Nikolakopoulos
Title: Estimating Commonsense Scene Composition on Belief Scene Graphs
Abstract:
This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types.
中文: 本研究提出常识场景组合概念,通过扩展信念场景图来估计未见物体的空间分布,采用两种关联信息模型,并在模拟和真实室内环境中验证了该框架的有效性。
English: This study introduces commonsense scene composition by extending Belief Scene Graphs to estimate the spatial distribution of unseen objects, utilizing two Correlation Information models and validating the framework through simulated and real-world indoor environments.

Authors:Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey
Title: Always Skip Attention
Abstract:
We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.
中文: 现代视觉变换器中的自注意力机制若无跳跃连接则无法有效训练,因其本质上是病态的,而提出的令牌灰度化方法能进一步改善输入令牌的条件。
English: Modern Vision Transformers critically depend on skip connections to train self-attention effectively, as it is fundamentally ill-conditioned without them, and the proposed Token Graying method further enhances input token conditioning.

Authors:Jun Takamatsu, Atsushi Kanehira, Kazuhiro Sasabuchi, Naoki Wake, Katsushi Ikeuchi
Title: IK Seed Generator for Dual-Arm Human-like Physicality Robot with Mobile Base
Abstract:
Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations tend to have difficulty solving inverse kinematics (IK) due to mechanical limitations, such as joint angle limitations. Conversely, if the difficulty coming from this limitation could be mitigated, one can expect that the use of such robots becomes more valuable. In numerical IK solver, which is commonly used for robots with higher degrees-of-freedom (DOF), the solvability of IK depends on the initial guess given to the solver. Thus, this paper proposes a method for generating a good initial guess for a numerical IK solver given the target hand configuration. For the purpose, we define the goodness of an initial guess using the scaled Jacobian matrix, which can calculate the manipulability index considering the joint limits. These two factors are related to the difficulty of solving IK. We generate the initial guess by optimizing the goodness using the genetic algorithm (GA). To enumerate much possible IK solutions, we use the reachability map that represents the reachable area of the robot hand in the arm-base coordinate system. We conduct quantitative evaluation and prove that using an initial guess that is judged to be better using the goodness value increases the probability that IK is solved. Finally, as an application of the proposed method, we show that by generating good initial guesses for IK a robot actually achieves three typical scenarios.
中文: 本文提出了一种利用遗传算法生成数值逆运动学求解器最优初始猜测的方法,通过评估可操作性和关节限制来提高求解概率,使机器人能在机械限制下有效执行任务。
English: This paper proposes a method using a genetic algorithm to generate optimal initial guesses for numerical inverse kinematics solvers by evaluating manipulability and joint limits, thereby enhancing solving probability and enabling robots to perform tasks despite mechanical constraints.

Authors:Abu Saleh Musa Miah, taro Suzuki, Jungpil Shin
Title: A Methodological and Structural Review of Parkinsons Disease Detection Across Diverse Data Modalities
Abstract:
Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.
中文: 本综述研究通过分析七种数据模态的347篇文献,弥补了现有帕金森病识别研究局限于单一模态的不足,为开发基于多模态融合的新一代诊断系统提供实践指导,旨在通过前沿机器学习方法提升诊疗水平。
English: This comprehensive review addresses limitations in existing Parkinson's Disease recognition surveys by examining multimodal approaches across seven data types, analyzing 347 studies to provide actionable guidance for developing next-generation diagnostic systems that improve patient care through advanced machine learning techniques.

Authors:Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea
Title: Cascading Adversarial Bias from Injection to Distillation in Language Models
Abstract:
Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
中文摘要:本研究发现,蒸馏语言模型在训练过程中极易受到对抗性偏见注入的攻击,教师模型的少量数据污染会导致学生模型中偏见显著放大,暴露出当前防御机制存在严重安全漏洞。
English Summary: This study reveals that distilled language models are highly vulnerable to adversarial bias injection during training, where minimal data poisoning in teacher models leads to significantly amplified biases in student models, exposing critical security gaps in current defense mechanisms.

Authors:Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea
Title: Cascading Adversarial Bias from Injection to Distillation in Language Models
Abstract:
Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
中文摘要:本研究发现,蒸馏语言模型在训练过程中极易受到对抗性偏见注入的攻击,教师模型的少量数据污染会导致学生模型中偏见显著放大,暴露出当前防御机制存在严重安全漏洞。
English Summary: This study reveals that distilled language models are highly vulnerable to adversarial bias injection during training, where minimal data poisoning in teacher models leads to significantly amplified biases in student models, exposing critical security gaps in current defense mechanisms.

Authors:Jiaxu Zhang, Xianfang Zeng, Xin Chen, Wei Zuo, Gang Yu, Guosheng Lin, Zhigang Tu
Title: DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds
Abstract:
This paper presents DreamDance, a novel character art animation framework capable of producing stable, consistent character and scene motion conditioned on precise camera trajectories. To achieve this, we re-formulate the animation task as two inpainting-based steps: Camera-aware Scene Inpainting and Pose-aware Video Inpainting. The first step leverages a pre-trained image inpainting model to generate multi-view scene images from the reference art and optimizes a stable large-scale Gaussian field, which enables coarse background video rendering with camera trajectories. However, the rendered video is rough and only conveys scene motion. To resolve this, the second step trains a pose-aware video inpainting model that injects the dynamic character into the scene video while enhancing background quality. Specifically, this model is a DiT-based video generation model with a gating strategy that adaptively integrates the character's appearance and pose information into the base background video. Through extensive experiments, we demonstrate the effectiveness and generalizability of DreamDance, producing high-quality and consistent character animations with remarkable camera dynamics.
DreamDance是一种新颖的动画框架,它通过相机感知的场景修复生成背景视频,再结合姿态感知的视频修复融入动态角色,从而创造出稳定的角色与场景运动。
DreamDance is a novel animation framework that creates stable character and scene motion by first generating background videos through camera-aware scene inpainting and then integrating dynamic characters via pose-aware video inpainting.

Authors:Zeyuan Liu, Zhihe Yang, Jiawei Xu, Rui Yang, Jiafei Lyu, Baoxiang Wang, Yunjian Xu, Xiu Li
Title: ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning
Abstract:
Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.
中文: 提出的ADG方法利用扩散模型区分并修复离线强化学习中的噪声数据,能在不改变现有算法的情况下有效提升多类基准任务的数据鲁棒性。
English: The proposed ADG method leverages diffusion models to effectively recover corrupted datasets in offline reinforcement learning by distinguishing and refining noisy data, enhancing robustness across various benchmarks without requiring modifications to existing RL algorithms.

Authors:Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
Title: ATLAS: Learning to Optimally Memorize the Context at Test Time
Abstract:
Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.
中文: Transformer在序列建模中表现出色,但因二次复杂度在长序列中受限,为此开发了ATLAS记忆模块,该模块通过优化记忆机制提升长上下文处理能力,在多项任务中超越现有模型表现。
English: Transformers excel in sequence modeling but face limitations with long sequences due to quadratic complexity, leading to the development of ATLAS, a memory module that enhances long-term context handling and outperforms existing models in various tasks.

Authors:Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine
Title: Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Abstract:
Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains an open question whether they preserve or degrade the semantic knowledge contained in the pretrained VLM, and what effect they have on the VLA training dynamics. In this paper, we study this question in the context of VLAs that include a continuous diffusion or flow matching action expert, showing that naively including such experts significantly harms both training speed and knowledge transfer. We provide an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue. Videos are available at https://pi.website/research/knowledge_insulation.
中文: 视觉-语言-动作模型在整合连续动作专家时面临实时控制与语义知识保留的平衡难题,本研究提出一种隔离训练技术以缓解由此引发的性能下降问题。
English: Vision-language-action models face a challenge in balancing real-time control needs with preserving semantic knowledge from large vision-language models, and this study proposes a technique to mitigate performance degradation caused by integrating continuous action experts.

Authors:Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, Xin Wang
Title: Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems
Abstract:
The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-leanrner.
中文: 本研究提出因果分析框架,揭示适度稀疏的通信拓扑能通过抑制错误传播同时保留有益信息扩散来优化多智能体系统性能,由此开发的EIB-leanrner方法在效能与通信成本方面均优于现有方案。
English: This study introduces a causal framework revealing that moderately sparse communication topologies optimize multi-agent system performance by curbing error propagation while maintaining beneficial information flow, leading to the development of the EIB-leanrner design method that outperforms alternatives in effectiveness and efficiency.

Authors:Yixin Ren, Chenghou Jin, Yewei Xia, Li Ke, Longtao Huang, Hui Xue, Hao Zhang, Jihong Guan, Shuigeng Zhou
Title: Score-based Generative Modeling for Conditional Independence Testing
Abstract:
Determining conditional independence (CI) relationships between random variables is a fundamental yet challenging task in machine learning and statistics, especially in high-dimensional settings. Existing generative model-based CI testing methods, such as those utilizing generative adversarial networks (GANs), often struggle with undesirable modeling of conditional distributions and training instability, resulting in subpar performance. To address these issues, we propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power. Concretely, we first employ a sliced conditional score matching scheme to accurately estimate conditional score and use Langevin dynamics conditional sampling to generate null hypothesis samples, ensuring precise Type I error control. Then, we incorporate a goodness-of-fit stage into the method to verify generated samples and enhance interpretability in practice. We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests. Extensive experiments on both synthetic and real-world datasets show that our method significantly outperforms existing state-of-the-art methods, providing a promising way to revitalize generative model-based CI testing.
中文摘要:本文提出了一种基于分数生成建模的新型条件独立性检验方法,通过条件分数估计和拟合优度验证,克服了现有方法的局限性,实现了精确的第一类错误控制和强大的检验效能。
English Summary: This paper introduces a novel conditional independence testing method using score-based generative modeling, which overcomes limitations of existing approaches by ensuring precise Type I error control and strong testing power through conditional score estimation and goodness-ofit verification.

Authors:Jingxuan Wei, Nan Xu, Junnan Zhu, Yanni Hao, Gaowei Wu, Bihui Yu, Lei Wang
Title: ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Abstract:
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
中文:ChartMind是一个面向复杂图表问答的新基准,支持多语言、开放式回答和多种图表类型,而提出的ChartLLM框架通过提取关键上下文和减少噪声来增强推理能力,在实际评估中优于现有方法。
English: ChartMind is a new benchmark for complex chart question answering that supports multilingual, open-ended responses and diverse chart types, while the proposed ChartLLM framework enhances reasoning by extracting key context and reducing noise, outperforming existing methods in real-world evaluations.

Authors:Alex Iacob, Lorenzo Sani, Mher Safaryan, Paris Giampouras, Samuel Horváth, Andrej Jovanovic, Meghdad Kurmanji, Preslav Aleksandrov, William F. Shen, Xinchi Qiu, Nicholas D. Lane
Title: DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
Abstract:
Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.
Chinese: DES-LOC优化器通过异步更新参数和动量状态,显著降低了分布式训练中的通信开销,比DDP减少高达170倍的通信量,同时保持收敛性并具备容错能力。
English: DES-LOC optimizers reduce communication costs in distributed training by desynchronizing parameter and momentum updates, achieving up to 170x less communication than DDP while maintaining convergence and fault tolerance.

Authors:Haomiao Qiu, Miao Zhang, Ziyue Qiao, Weili Guan, Min Zhang, Liqiang Nie
Title: SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting
Abstract:
Continual Learning requires a model to learn multiple tasks in sequence while maintaining both stability:preserving knowledge from previously learned tasks, and plasticity:effectively learning new tasks. Gradient projection has emerged as an effective and popular paradigm in CL, where it partitions the gradient space of previously learned tasks into two orthogonal subspaces: a primary subspace and a minor subspace. New tasks are learned effectively within the minor subspace, thereby reducing interference with previously acquired knowledge. However, existing Gradient Projection methods struggle to achieve an optimal balance between plasticity and stability, as it is hard to appropriately partition the gradient space. In this work, we consider a continual learning paradigm based on Low-Rank Adaptation, which has gained considerable attention due to its efficiency and wide applicability, and propose a novel approach for continual learning, called SplitLoRA. We first provide a theoretical analysis of how subspace partitioning affects model stability and plasticity. Informed by this analysis, we then introduce an effective method that derives the optimal partition of the gradient space for previously learned tasks. This approach effectively balances stability and plasticity in continual learning. Experimental results on multiple datasets demonstrate that the proposed method achieves state-of-the-art performance.
中文: 提出的SplitLoRA方法通过利用低秩自适应优化持续学习中的梯度空间划分,有效平衡稳定性和可塑性,在多个数据集上实现了最先进的性能。
English: The proposed SplitLoRA method optimizes gradient space partitioning in continual learning by leveraging Low-Rank Adaptation, effectively balancing stability and plasticity to achieve state-of-the-art performance across multiple datasets.

Authors:Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao
Title: Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation
Abstract:
Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.
中文:跨模态RAG提出了一种创新框架,将查询和图像分解为子维度组件,通过混合检索策略实现子查询感知的检索与生成,在多个数据集上显著超越了现有方法的检索效果和图像合成质量。
English: Cross-modal RAG introduces a novel framework that decomposes queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation through a hybrid retrieval strategy, which significantly outperforms existing methods in both retrieval and image synthesis across multiple datasets.

Authors:Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao
Title: Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation
Abstract:
Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.
中文:跨模态RAG提出了一种创新框架,将查询和图像分解为子维度组件,通过混合检索策略实现子查询感知的检索与生成,在多个数据集上显著超越了现有方法的检索效果和图像合成质量。
English: Cross-modal RAG introduces a novel framework that decomposes queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation through a hybrid retrieval strategy, which significantly outperforms existing methods in both retrieval and image synthesis across multiple datasets.

Authors:Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li
Title: SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training
Abstract:
Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
中文摘要:本研究提出DPO-C&M和SDPO方法,解决了扩散偏好学习中的时间步不稳定性和离策略偏差问题,在多个模型上的实验表明这些方法优于现有技术。
English Summary: The study introduces DPO-C&M and SDPO methods to overcome timestep instability and off-policy bias in diffusion-based preference learning, demonstrating superior performance over existing approaches through experiments on multiple models.

Authors:Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li
Title: SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training
Abstract:
Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
中文摘要:本研究提出DPO-C&M和SDPO方法,解决了扩散偏好学习中的时间步不稳定性和离策略偏差问题,在多个模型上的实验表明这些方法优于现有技术。
English Summary: The study introduces DPO-C&M and SDPO methods to overcome timestep instability and off-policy bias in diffusion-based preference learning, demonstrating superior performance over existing approaches through experiments on multiple models.

Authors:Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Abstract:
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
中文: 本文提出一种方法,通过在去噪过程中预测并优化与初始噪声对齐的空间布局,有效防止文本到图像扩散模型中的主体混淆,提升多主体生成的准确性和稳定性,同时保持模型原有的多样性。
English: This paper introduces a method that predicts and refines a spatial layout aligned with the initial noise during denoising to prevent subject leakage and enhance multi-subject generation in text-to-image diffusion models, improving alignment and stability while preserving diversity.

Authors:Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, Soumik Sarkar
Title: DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models
Abstract:
Low-Rank Adaptation (LoRA) has emerged as one of the most effective, computationally tractable fine-tuning approaches for training Vision-Language Models (VLMs) and Large Language Models (LLMs). LoRA accomplishes this by freezing the pre-trained model weights and injecting trainable low-rank matrices, allowing for efficient learning of these foundation models even on edge devices. However, LoRA in decentralized settings still remains under explored, particularly for the theoretical underpinnings due to the lack of smoothness guarantee and model consensus interference (defined formally below). This work improves the convergence rate of decentralized LoRA (DLoRA) to match the rate of decentralized SGD by ensuring gradient smoothness. We also introduce DeCAF, a novel algorithm integrating DLoRA with truncated singular value decomposition (TSVD)-based matrix factorization to resolve consensus interference. Theoretical analysis shows TSVD's approximation error is bounded and consensus differences between DLoRA and DeCAF vanish as rank increases, yielding DeCAF's matching convergence rate. Extensive experiments across vision/language tasks demonstrate our algorithms outperform local training and rivals federated learning under both IID and non-IID data distributions.
中文: 本研究通过确保梯度平滑性使去中心化LoRA达到与去中心化SGD相同的收敛速度,并提出DeCAF算法——将DLoRA与基于截断奇异值分解的矩阵分解相结合以消除共识干扰,在视觉和语言任务中显著优于本地训练并与联邦学习相媲美。
English: This work enhances decentralized LoRA (DLoRA) by ensuring gradient smoothness to match decentralized SGD's convergence rate and introduces DeCAF, a novel algorithm combining DLoRA with TSVD-based factorization to eliminate consensus interference, achieving superior performance across vision and language tasks compared to local training and federated learning.

Authors:Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue
Title: MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Abstract:
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.
中文: MME-Reasoning基准通过涵盖归纳、演绎和溯因三种推理类型的问题,全面评估多模态大语言模型的逻辑推理能力,揭示了当前最先进模型存在显著局限性和性能不平衡问题。
English: The MME-Reasoning benchmark is introduced to comprehensively evaluate multimodal large language models' logical reasoning abilities across inductive, deductive, and abductive types, revealing significant limitations and performance imbalances in current state-of-the-art models.

Authors:Shuai Wang, Zexian Li, Qipeng zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang
Title: Differentiable Solver Search for Fast Diffusion Sampling
Abstract:
Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.
Chinese: 本文提出了一种可微分的求解器搜索算法,为扩散模型找到了更高效的求解器,在仅10步采样下显著超越传统方法,并在多种架构上实现了更优的FID分数。
English: This paper introduces a differentiable solver search algorithm that identifies more efficient solvers for diffusion models, significantly outperforming traditional methods with improved FID scores across various architectures using only 10 sampling steps.

Authors:Kaining Wang, Bo Yang, Yusheng Lei, Zhiwen Yu, Xuelin Cao, George C. Alexandropoulos, Marco Di Renzo, Chau Yuen
Title: Dynamical ON-OFF Control with Trajectory Prediction for Multi-RIS Wireless Networks
Abstract:
Reconfigurable intelligent surfaces (RISs) have demonstrated an unparalleled ability to reconfigure wireless environments by dynamically controlling the phase, amplitude, and polarization of impinging waves. However, as nearly passive reflective metasurfaces, RISs may not distinguish between desired and interference signals, which can lead to severe spectrum pollution and even affect performance negatively. In particular, in large-scale networks, the signal-to-interference-plus-noise ratio (SINR) at the receiving node can be degraded due to excessive interference reflected from the RIS. To overcome this fundamental limitation, we propose in this paper a trajectory prediction-based dynamical control algorithm (TPC) for anticipating RIS ON-OFF states sequence, integrating a long-short-term-memory (LSTM) scheme to predict user trajectories. In particular, through a codebook-based algorithm, the RIS controller adaptively coordinates the configuration of the RIS elements to maximize the received SINR. Our simulation results demonstrate the superiority of the proposed TPC method over various system settings.
中文: 可重构智能表面能动态调控无线环境但可能引发干扰,本文提出基于轨迹预测的控制算法,利用长短期记忆网络优化其配置以提升信干噪比。
English: Reconfigurable intelligent surfaces (RISs) can dynamically control wireless environments but may cause interference, so this paper proposes a trajectory prediction-based control algorithm using LSTM to optimize RIS configurations and improve signal-to-interference-plus-noise ratio.

Authors:Kohei Obata, Yasuko Matsubara, Yasushi Sakurai
Title: Robust and Explainable Detector of Time Series Anomaly via Augmenting Multiclass Pseudo-Anomalies
Abstract:
Unsupervised anomaly detection in time series has been a pivotal research area for decades. Current mainstream approaches focus on learning normality, on the assumption that all or most of the samples in the training set are normal. However, anomalies in the training set (i.e., anomaly contamination) can be misleading. Recent studies employ data augmentation to generate pseudo-anomalies and learn the boundary separating the training samples from the augmented samples. Although this approach mitigates anomaly contamination if augmented samples mimic unseen real anomalies, it suffers from several limitations. (1) Covering a wide range of time series anomalies is challenging. (2) It disregards augmented samples that resemble normal samples (i.e., false anomalies). (3) It places too much trust in the labels of training and augmented samples. In response, we propose RedLamp, which employs diverse data augmentations to generate multiclass pseudo-anomalies and learns the multiclass boundary. Such multiclass pseudo-anomalies cover a wide variety of time series anomalies. We conduct multiclass classification using soft labels, which prevents the model from being overconfident and ensures its robustness against contaminated/false anomalies. The learned latent space is inherently explainable as it is trained to separate pseudo-anomalies into multiclasses. Extensive experiments demonstrate the effectiveness of RedLamp in anomaly detection and its robustness against anomaly contamination.
中文: RedLamp通过数据增强生成多样化的多类伪异常,并采用软标签分类方法,有效提升了时间序列异常检测的鲁棒性,同时增强了模型对异常污染和虚假异常的抵抗能力。
English: RedLamp introduces a novel approach to unsupervised time series anomaly detection by generating diverse multiclass pseudo-anomalies through data augmentation and employing soft-label classification to enhance robustness against contamination and false anomalies.

Authors:Kai Chen, Zihao He, Taiwei Shi, Kristina Lerman
Title: STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models
Abstract:
Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.
Chinese: Steer-Bench 是一个用于评估大型语言模型适应不同社区规范能力的基准测试,通过对13个流行模型的评估发现,它们在社区敏感性调控方面显著落后于人类专家,某些情况下准确率差距超过15个百分点。
English: Steer-Bench is a benchmark introduced to evaluate how well large language models can adapt their outputs to align with diverse community norms, revealing through tests on 13 models that they significantly lag behind human experts in community-sensitive steerability, with accuracy gaps exceeding 15 percentage points in some cases.

Authors:Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong
Title: MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents
Abstract:
Modern task-oriented dialogue (TOD) systems increasingly rely on large language model (LLM) agents, leveraging Retrieval-Augmented Generation (RAG) and long-context capabilities for long-term memory utilization. However, these methods are primarily based on semantic similarity, overlooking task intent and reducing task coherence in multi-session dialogues. To address this challenge, we introduce MemGuide, a two-stage framework for intent-driven memory selection. (1) Intent-Aligned Retrieval matches the current dialogue context with stored intent descriptions in the memory bank, retrieving QA-formatted memory units that share the same goal. (2) Missing-Slot Guided Filtering employs a chain-of-thought slot reasoner to enumerate unfilled slots, then uses a fine-tuned LLaMA-8B filter to re-rank the retrieved units by marginal slot-completion gain. The resulting memory units inform a proactive strategy that minimizes conversational turns by directly addressing information gaps. Based on this framework, we introduce the MS-TOD, the first multi-session TOD benchmark comprising 132 diverse personas, 956 task goals, and annotated intent-aligned memory targets, supporting efficient multi-session task completion. Evaluations on MS-TOD show that MemGuide raises the task success rate by 11% (88% -> 99%) and reduces dialogue length by 2.84 turns in multi-session settings, while maintaining parity with single-session benchmarks.
中文摘要:MemGuide提出了一种意图驱动的记忆选择框架,通过主动填补信息空白来提升多轮任务导向对话的任务连贯性并减少对话轮次。
English Summary: MemGuide introduces an intent-driven memory selection framework that enhances multi-session task-oriented dialogues by improving task coherence and reducing conversational turns through proactive gap-filling.

Authors:Gokul Adethya, Bhanu Pratyush Mantha, Tianyang Wang, Xingjian Li, Min Xu
Title: SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection
Abstract:
Cryo-electron tomography (cryo-ET) has emerged as a powerful technique for imaging macromolecular complexes in their near-native states. However, the localization of 3D particles in cellular environments still presents a significant challenge due to low signal-to-noise ratios and missing wedge artifacts. Deep learning approaches have shown great potential, but they need huge amounts of data, which can be a challenge in cryo-ET scenarios where labeled data is often scarce. In this paper, we propose a novel Self-augmented and Self-interpreted (SaSi) deep learning approach towards few-shot particle detection in 3D cryo-ET images. Our method builds upon self-augmentation techniques to further boost data utilization and introduces a self-interpreted segmentation strategy for alleviating dependency on labeled data, hence improving generalization and robustness. As demonstrated by experiments conducted on both simulated and real-world cryo-ET datasets, the SaSi approach significantly outperforms existing state-of-the-art methods for particle localization. This research increases understanding of how to detect particles with very few labels in cryo-ET and thus sets a new benchmark for few-shot learning in structural biology.
Chinese: 本文提出了一种自增强自解释(SaSi)深度学习方法,通过提高数据利用率和减少对标注数据的依赖,显著提升了冷冻电子断层扫描中的少样本粒子检测性能,在模拟和真实数据集上均优于现有先进技术。
English: The paper introduces a Self-augmented and Self-interpreted (SaSi) deep learning method that enhances few-shot particle detection in cryo-electron tomography by improving data utilization and reducing reliance on labeled data, outperforming current techniques in both simulated and real-world datasets.

Authors:Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane
Title: Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?
Abstract:
Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" $\emptyset$ response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.
中文摘要:本文将大语言模型遗忘重新定义为通过修改知识生成拒绝回应的特殊编辑形式,研究表明WISE和AlphaEdit等前沿编辑技术可作为有效的遗忘基准方法,特别适用于预训练知识,并提出了自我改进与查询合并等优化策略。
English Summary: This paper redefines unlearning as a form of knowledge editing that modifies information to refusal responses, demonstrating that certain editing techniques like WISE and AlphaEdit serve as strong baselines for unlearning, particularly for pretrained knowledge, and proposes methods to enhance their application.

Authors:Shaukat Ali, Ana Cavalcanti, Cláudio Ângelo Gonçalves Gomes, Peter Gorm Larsen, Hassan Sartaj, Anastasios Tefas, Jim Woodcock, Houxiang Zhang
Title: Software Engineering for Self-Adaptive Robotics: A Research Agenda
Abstract:
Self-adaptive robotic systems are designed to operate autonomously in dynamic and uncertain environments, requiring robust mechanisms to monitor, analyse, and adapt their behaviour in real-time. Unlike traditional robotic software, which follows predefined logic, self-adaptive robots leverage artificial intelligence, machine learning, and model-driven engineering to continuously adjust to changing operational conditions while ensuring reliability, safety, and performance. This paper presents a research agenda for software engineering in self-adaptive robotics, addressing critical challenges across two key dimensions: (1) the development phase, including requirements engineering, software design, co-simulation, and testing methodologies tailored to adaptive robotic systems, and (2) key enabling technologies, such as digital twins, model-driven engineering, and AI-driven adaptation, which facilitate runtime monitoring, fault detection, and automated decision-making. We discuss open research challenges, including verifying adaptive behaviours under uncertainty, balancing trade-offs between adaptability, performance, and safety, and integrating self-adaptation frameworks like MAPE-K. By providing a structured roadmap, this work aims to advance the software engineering foundations for self-adaptive robotic systems, ensuring they remain trustworthy, efficient, and capable of handling real-world complexities.
自适应机器人系统利用人工智能和模型驱动方法在动态环境中自主调整,本文提出了软件工程路线图,通过解决生命周期挑战和数字孪生等使能技术,旨在到2030年构建可信赖的自适应系统。
Self-adaptive robotic systems leverage AI and model-driven approaches to autonomously adjust in dynamic environments, with this paper outlining a software engineering roadmap addressing lifecycle challenges and enabling technologies like digital twins to build trustworthy systems by 2030.

Authors:Hassan Sartaj, Shaukat Ali, Ana Cavalcanti, Lukas Esterle, Cláudio Gomes, Peter Gorm Larsen, Anastasios Tefas, Jim Woodcock, Houxiang Zhang
Title: Software Engineering for Self-Adaptive Robotics: A Research Agenda
Abstract:
Self-adaptive robotic systems operate autonomously in dynamic and uncertain environments, requiring robust real-time monitoring and adaptive behaviour. Unlike traditional robotic software with predefined logic, self-adaptive robots exploit artificial intelligence (AI), machine learning, and model-driven engineering to adapt continuously to changing conditions, thereby ensuring reliability, safety, and optimal performance. This paper presents a research agenda for software engineering in self-adaptive robotics, structured along two dimensions. The first concerns the software engineering lifecycle, requirements, design, development, testing, and operations, tailored to the challenges of self-adaptive robotics. The second focuses on enabling technologies such as digital twins, AI-driven adaptation, and quantum computing, which support runtime monitoring, fault detection, and automated decision-making. We identify open challenges, including verifying adaptive behaviours under uncertainty, balancing trade-offs between adaptability, performance, and safety, and integrating self-adaptation frameworks like MAPE-K/MAPLE-K. By consolidating these challenges into a roadmap toward 2030, this work contributes to the foundations of trustworthy and efficient self-adaptive robotic systems capable of meeting the complexities of real-world deployment.
自适应机器人系统利用人工智能和模型驱动方法在动态环境中自主调整,本文提出了软件工程路线图,通过解决生命周期挑战和数字孪生等使能技术,旨在到2030年构建可信赖的自适应系统。
Self-adaptive robotic systems leverage AI and model-driven approaches to autonomously adjust in dynamic environments, with this paper outlining a software engineering roadmap addressing lifecycle challenges and enabling technologies like digital twins to build trustworthy systems by 2030.

Authors:Hassan Sartaj, Shaukat Ali
Title: Search-Based Software Engineering in the Landscape of AI Foundation Models
Abstract:
Search-based software engineering (SBSE), at the intersection of artificial intelligence (AI) and software engineering, has been an active area of research for about 25 years. It has been applied to solve numerous problems across the entire software engineering lifecycle and has demonstrated its versatility in multiple domains. With the recent advancements in AI, particularly the emergence of foundation models (FMs), the evolution of SBSE alongside FMs remains undetermined. In this window of opportunity, we propose a research roadmap that articulates the current landscape of SBSE in relation to foundation models (FMs), highlights open challenges, and outlines potential research directions for advancing SBSE through its interplay with FMs. This roadmap aims to establish a forward-thinking and innovative perspective for the future of SBSE in the era of FMs.
中文摘要:基于搜索的软件工程(SBSE)历经25年发展已解决软件生命周期中的众多问题,本研究路线图通过分析五个核心维度并展望未来协同可能,系统阐述了SBSE与基础模型的融合路径。
English Summary: Search-based software engineering (SBSE) has evolved over 25 years to address diverse software lifecycle challenges, and this research roadmap explores its integration with foundation models by analyzing five core aspects and envisioning future synergies.

Authors:Hassan Sartaj, Shaukat Ali, Paolo Arcaini, Andrea Arcuri
Title: Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap
Abstract:
Search-based software engineering (SBSE), which integrates metaheuristic search techniques with software engineering, has been an active area of research for about 25 years. It has been applied to solve numerous problems across the entire software engineering lifecycle and has demonstrated its versatility in multiple domains. With recent advances in AI, particularly the emergence of foundation models (FMs) such as large language models (LLMs), the evolution of SBSE alongside these models remains undetermined. In this window of opportunity, we present a research roadmap that articulates the current landscape of SBSE in relation to FMs, identifies open challenges, and outlines potential research directions to advance SBSE through its integration and interplay with FMs. Specifically, we analyze five core aspects: leveraging FMs for SBSE design, applying FMs to complement SBSE in SE problems, employing SBSE to address FM challenges, adapting SBSE practices for FMs tailored to SE activities, and exploring the synergistic potential between SBSE and FMs. Furthermore, we present a forward-thinking perspective that envisions the future of SBSE in the era of FMs, highlighting promising research opportunities to address challenges in emerging domains.
中文摘要:基于搜索的软件工程(SBSE)历经25年发展已解决软件生命周期中的众多问题,本研究路线图通过分析五个核心维度并展望未来协同可能,系统阐述了SBSE与基础模型的融合路径。
English Summary: Search-based software engineering (SBSE) has evolved over 25 years to address diverse software lifecycle challenges, and this research roadmap explores its integration with foundation models by analyzing five core aspects and envisioning future synergies.

Authors:Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Title: LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Abstract:
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.
Chinese: 本文提出方差缩减偏好优化(VRPO)框架,通过分析ELBO估计器的方差并引入无偏方差缩减策略,有效解决了掩码扩散模型对齐中的高方差问题,由此开发的LLaDA 1.5模型在数学、代码和对齐基准测试中均展现出显著性能提升。
English: This paper introduces Variance-Reduced Preference Optimization (VRPO), a novel framework that addresses the high variance in ELBO-based likelihood estimates to effectively align Masked Diffusion Models like LLaDA with human preferences, resulting in the enhanced LLaDA 1.5 model which demonstrates significant performance improvements across multiple benchmarks.

Authors:Chengbo He, Bochao Zou, Junliang Xing, Jiansheng Chen, Yuanchun Shi, Huimin Ma
Title: DeCoDe: Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck Models
Abstract:
In human-AI collaboration, a central challenge is deciding whether the AI should handle a task, be deferred to a human expert, or be addressed through collaborative effort. Existing Learning to Defer approaches typically make binary choices between AI and humans, neglecting their complementary strengths. They also lack interpretability, a critical property in high-stakes scenarios where users must understand and, if necessary, correct the model's reasoning. To overcome these limitations, we propose Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck Models (DeCoDe), a concept-driven framework for human-AI collaboration. DeCoDe makes strategy decisions based on human-interpretable concept representations, enhancing transparency throughout the decision process. It supports three flexible modes: autonomous AI prediction, deferral to humans, and human-AI collaborative complementarity, selected via a gating network that takes concept-level inputs and is trained using a novel surrogate loss that balances accuracy and human effort. This approach enables instance-specific, interpretable, and adaptive human-AI collaboration. Experiments on real-world datasets demonstrate that DeCoDe significantly outperforms AI-only, human-only, and traditional deferral baselines, while maintaining strong robustness and interpretability even under noisy expert annotations.
中文: 提出的DeCoDe框架通过基于可解释概念的决策机制,在自主AI处理、转交人类专家和人机协同三种模式间灵活选择,实现了透明且自适应的人机协作,其性能显著优于现有方法。
English: The proposed DeCoDe framework enables interpretable and adaptive human-AI collaboration by using concept-driven decision-making to flexibly choose among autonomous AI, human deferral, or complementary collaboration modes, significantly outperforming existing methods while maintaining transparency.

Authors:Yuqi Liu, Qin Jin, Tianyuan Qu, Xuan Liu, Yang Du, Bei Yu, Jiaya Jia
Title: RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models
Abstract:
Understanding accurate atomic temporal event is essential for video comprehension. However, current video-language benchmarks often fall short to evaluate Large Multi-modal Models' (LMMs) temporal event understanding capabilities, as they can be effectively addressed using image-language models. In this paper, we introduce RTime-QA, a novel benchmark specifically designed to assess the atomic temporal event understanding ability of LMMs. RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts. Each question features a video depicting an atomic temporal event, paired with both correct answers and temporal negative descriptions, specifically designed to evaluate temporal understanding. To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA. Extensive experimental analysis demonstrates that RTime-QA presents a significant challenge for LMMs: the state-of-the-art model Qwen2-VL achieves only 34.6 on strict-ACC metric, substantially lagging behind human performance. Furthermore, our experiments reveal that RTime-IT effectively enhance LMMs' capacity in temporal understanding. By fine-tuning on RTime-IT, our Qwen2-VL achieves 65.9 on RTime-QA.
中文摘要:本文提出了RTime-QA基准测试,专门用于评估大型多模态模型对视频中原子时序事件的理解能力,并通过RTime-IT指令调优数据集有效提升了模型在该任务上的表现。
English Summary: The paper introduces RTime-QA, a benchmark designed to evaluate large multi-modal models' understanding of atomic temporal events in videos, and RTime-IT, an instruction-tuning dataset that significantly improves model performance on this task.

Authors:Yanben Shen, Timilehin T. Ayanlade, Venkata Naresh Boddepalli, Mojdeh Saadati, Ashlyn Rairdin, Zi K. Deng, Muhammad Arbab Arshad, Aditya Balu, Daren Mueller, Asheesh K Singh, Wesley Everman, Nirav Merchant, Baskar Ganapathysubramanian, Meaghan Anderson, Soumik Sarkar, Arti Singh
Title: WeedNet: A Foundation Model-Based Global-to-Local AI Approach for Real-Time Weed Species Identification and Classification
Abstract:
Early identification of weeds is essential for effective management and control, and there is growing interest in automating the process using computer vision techniques coupled with AI methods. However, challenges associated with training AI-based weed identification models, such as limited expert-verified data and complexity and variability in morphological features, have hindered progress. To address these issues, we present WeedNet, the first global-scale weed identification model capable of recognizing an extensive set of weed species, including noxious and invasive plant species. WeedNet is an end-to-end real-time weed identification pipeline and uses self-supervised learning, fine-tuning, and enhanced trustworthiness strategies. WeedNet achieved 91.02% accuracy across 1,593 weed species, with 41% species achieving 100% accuracy. Using a fine-tuning strategy and a Global-to-Local approach, the local Iowa WeedNet model achieved an overall accuracy of 97.38% for 85 Iowa weeds, most classes exceeded a 90% mean accuracy per class. Testing across intra-species dissimilarity (developmental stages) and inter-species similarity (look-alike species) suggests that diversity in the images collected, spanning all the growth stages and distinguishable plant characteristics, is crucial in driving model performance. The generalizability and adaptability of the Global WeedNet model enable it to function as a foundational model, with the Global-to-Local strategy allowing fine-tuning for region-specific weed communities. Additional validation of drone- and ground-rover-based images highlights the potential of WeedNet for integration into robotic platforms. Furthermore, integration with AI for conversational use provides intelligent agricultural and ecological conservation consulting tools for farmers, agronomists, researchers, land managers, and government agencies across diverse landscapes.
Chinese: WeedNet是全球首个大规模杂草识别模型,通过自监督学习和微调策略实现对多种杂草的高精度实时检测,可集成至机器人平台并为农业管理提供智能决策支持。
English: WeedNet is a global-scale AI model that achieves high accuracy in identifying diverse weed species through self-supervised learning and fine-tuning, enabling real-time weed recognition and integration with robotic platforms for agricultural management.

Authors:Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
Title: From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test
Abstract:
The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
Chinese: 本研究提出CultureSteer方法,将文化特定语义关联嵌入大语言模型内部表征空间,有效纠正其西方认知偏见,并通过词汇联想测试与文化敏感任务验证了跨文化认知对齐的显著提升。
English: This study introduces CultureSteer, a method that embeds cultural-specific semantic associations into LLMs to address their Western bias and improve cross-cultural cognitive alignment, as validated through word association tests and culture-sensitive tasks.

Authors:Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
Title: From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test
Abstract:
The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through culturally shared semantic expectations and implicit linguistic patterns shaped by lived experiences. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To address culture preference, we propose CultureSteer, an innovative approach that moves beyond superficial cultural prompting by embedding cultural-specific semantic associations directly within the model's internal representation space. Experiments show that current LLMs exhibit significant bias toward Western (notably American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
Chinese: 本研究提出CultureSteer方法,将文化特定语义关联嵌入大语言模型内部表征空间,有效纠正其西方认知偏见,并通过词汇联想测试与文化敏感任务验证了跨文化认知对齐的显著提升。
English: This study introduces CultureSteer, a method that embeds cultural-specific semantic associations into LLMs to address their Western bias and improve cross-cultural cognitive alignment, as validated through word association tests and culture-sensitive tasks.

Authors:He Zhu, Zhiwen Ruan, Junyou Su, Xingwei He, Yun Chen, Wenjia Zhang, Guanhua Chen
Title: TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation
Abstract:
High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.
中文: TAG-INSTRUCT框架通过将指令压缩至标签空间并利用强化学习系统性地提升难度,有效增强了大型语言模型的指令复杂度,其可控性和稳定性均优于现有方法。
English: The TAG-INSTRUCT framework improves instruction complexity for large language models by compressing instructions into a tag space and systematically enhancing difficulty through reinforcement learning, outperforming existing methods with superior controllability and stability.

Authors:Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
Title: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Abstract:
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.
Chinese: 本研究提出负感知微调(NFT)这一监督学习方法,使大语言模型能够通过二元反馈从自身错误中学习,挑战了强化学习在自我改进中的主导地位,并展现出与主流RL算法相当甚至更优的性能。
English: This work introduces Negative-aware Fine-Tuning (NFT), a supervised learning method that enables large language models to learn from their mistakes using binary feedback, challenging the dominance of reinforcement learning in self-improvement and demonstrating performance comparable to or better than leading RL algorithms.

Authors:Xingjian Li, Qifeng Wu, Colleen Que, Yiran Ding, Adithya S. Ubaradka, Jianhua Xing, Tianyang Wang, Min Xu
Title: AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models
Abstract:
Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline performs competitively with weakly-prompted interactive foundation models.
中文: 本文提出一种零样本医学图像分割流程,结合视觉语言与分割基础模型,通过无需专家标注的测试时自适应方法,显著提升了分割精度。
English: This paper presents a zero-shot medical image segmentation pipeline that integrates vision-language and segmentation foundation models, achieving significant accuracy improvements through test-time adaptation without requiring expert annotations.

Authors:Xingjian Li, Qifeng Wu, Adithya S. Ubaradka, Yiran Ding, Colleen Que, Runmin Jiang, Jianhua Xing, Tianyang Wang, Min Xu
Title: AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models
Abstract:
Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline not only substantially surpasses the previously best-performing method, yielding a 69\% relative improvement in accuracy (Dice Score from 42.53 to 71.81), but also performs competitively with weakly-prompted interactive foundation models.
中文: 本文提出一种零样本医学图像分割流程,结合视觉语言与分割基础模型,通过无需专家标注的测试时自适应方法,显著提升了分割精度。
English: This paper presents a zero-shot medical image segmentation pipeline that integrates vision-language and segmentation foundation models, achieving significant accuracy improvements through test-time adaptation without requiring expert annotations.

Authors:Xiang Liu, Zhaoxiang Liu, Peng Wang, Kohou Wang, Huan Hu, Kai Wang, Shiguo Lian
Title: SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models
Abstract:
When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM's past training data. However, if the SFT dataset largely overlaps with the model's existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.
中文: 为提高微调效率,我们提出一种自学习框架,仅利用SFT数据集中的未知知识进行训练,在保持与全数据集微调相当性能的同时,显著降低了计算成本。
English: To enhance fine-tuning efficiency, we propose a self-learning framework that identifies and utilizes only the unknown knowledge in the SFT dataset for training, significantly reducing computational costs while maintaining performance comparable to full dataset fine-tuning.

Authors:Xinbang Dai, Huikang Hu, Yuncheng Hua, Jiaqi Li, Yongrui Chen, Rihui Jin, Nan Hu, Guilin Qi
Title: After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG
Abstract:
Retrieval-augmented generation (RAG) systems face critical challenges in balancing internal (parametric) and external (retrieved) knowledge, especially when these sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs' trustworthy responses in real-world RAG applications.
中文摘要:BRIDGE框架通过自适应软偏置和决策树动态协调RAG系统中相互冲突的内外部知识,在多种场景下实现准确率提升5-15%的均衡表现。
English Summary: The BRIDGE framework dynamically balances conflicting internal and external knowledge in RAG systems using adaptive soft bias and decision trees, achieving 5-15% higher accuracy across diverse scenarios.

Authors:Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie
Title: Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization
Abstract:
Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks.
大型语言模型在问答任务中因幻觉问题难以保证事实准确性,而提出的Mujica框架结合MyGO强化学习方法,通过问题分解和响应优化有效提升了多跳推理能力。
Large Language Models struggle with factual accuracy in Question Answering due to hallucinations, but the proposed Mujica framework with MyGO reinforcement learning enhances multi-hop reasoning by decomposing questions and optimizing responses efficiently.

Authors:Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Abstract:
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
中文: LLaDA-V是一种基于扩散的多模态模型,通过将视觉指令调整与掩码扩散模型相结合,在文本任务能力较弱的情况下仍展现出竞争力的多模态性能,并在同类扩散模型中实现了最先进的效果。
English: LLaDA-V is a diffusion-based multimodal model that integrates visual instruction tuning with masked diffusion models, demonstrating competitive performance in multimodal tasks despite weaker text-only capabilities and achieving state-of-the-art results among diffusion-based MLLMs.

Authors:Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen
Title: Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Abstract:
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
中文摘要:本文提出了一种新颖的提示优化框架,利用大型视觉语言模型对文本到图像生成的用户提示进行优化,并通过AI反馈评估输出质量,无需大量人工标注数据即可实现卓越性能。
English Summary: This paper introduces a novel prompt optimization framework that uses large vision-language models to refine user prompts for text-to-image generation and evaluate the output quality through AI feedback, achieving superior performance without extensive manual data annotation.

Authors:Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang
Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
Abstract:
Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.
中文摘要:本文提出KRIS-Bench诊断基准,通过三种知识类型评估图像编辑系统的知识推理能力,综合评估显示现有模型存在明显性能差距。
English Summary: The paper introduces KRIS-Bench, a diagnostic benchmark for evaluating knowledge-based reasoning in image-editing systems across three knowledge types, revealing significant performance gaps in current models through comprehensive evaluation.

Authors:Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Title: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Abstract:
Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to reduce faithfulness hallucinations of LLMs across different downstream tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
中文: CANOE框架通过合成短问答数据和应用Dual-GRPO强化学习方法,有效减少大语言模型的忠实性幻觉,在无需人工标注的情况下显著提升了11项任务中的模型表现。
English: The CANOE framework reduces faithfulness hallucinations in large language models by synthesizing short-form QA data and applying the Dual-GRPO reinforcement learning method, significantly improving model performance across 11 tasks without human annotations.

Authors:Pinxin Liu, Haiyang Liu, Luchuan Song, Chenliang Xu
Title: Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Abstract:
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
中文:提出的Intentional-Gesture框架通过将交流意图融入动作合成,解决了当前手势生成方法语义浅薄的问题,利用意图标注数据集和标记化动作表征实现了最先进的性能。
English: The proposed Intentional-Gesture framework addresses the semantic limitations of current gesture generation methods by incorporating communicative intentions into motion synthesis, achieving state-of-the-art performance through intention-annotated datasets and tokenized motion representations.

Authors:Pinxin Liu, Haiyang Liu, Luchuan Song, Jason J. Corso, Chenliang Xu
Title: Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Abstract:
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (e.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce Intentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. First, we curate the InG dataset by augmenting BEAT-2 with gesture-intention annotations (i.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the Intentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (e.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
中文:提出的Intentional-Gesture框架通过将交流意图融入动作合成,解决了当前手势生成方法语义浅薄的问题,利用意图标注数据集和标记化动作表征实现了最先进的性能。
English: The proposed Intentional-Gesture framework addresses the semantic limitations of current gesture generation methods by incorporating communicative intentions into motion synthesis, achieving state-of-the-art performance through intention-annotated datasets and tokenized motion representations.

Authors:Jinghui Lu, Haiyang Yu, Siliang Xu, Shiwei Ran, Guozhi Tang, Siqi Wang, Bin Shan, Teng Fu, Hao Feng, Jingqun Tang, Han Wang, Can Huang
Title: Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Abstract:
Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
中文摘要:CAR框架根据模型置信度动态调整推理长度,通过平衡准确性与效率,在多种任务中均优于简短回答和长链推理方法。
English Summary: The CAR framework dynamically adjusts reasoning length based on model confidence, outperforming both short and long reasoning methods by balancing accuracy and efficiency across diverse tasks.

Authors:Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao
Title: Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory
Abstract:
The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
中文: 本文批判性分析了大型语言模型基准测试的有效性,提出了基于项目反应理论的增强框架PSN-IRT,该框架不仅揭示了当前基准测试存在的显著测量缺陷,还能构建出更精简且更符合人类偏好的评估体系。
English: This paper critically analyzes the effectiveness of LLM benchmarks and introduces PSN-IRT, an enhanced Item Response Theory framework that reveals significant measurement flaws in current benchmarks while enabling the creation of more compact yet human-aligned evaluation tools.

Authors:Jiahao Yu, Haozhuang Liu, Yeqiu Yang, Lu Chen, Wu Jian, Yuning Jiang, Bo Zheng
Title: TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems
Abstract:
Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform), to serve the major online traffic. Codes will be released after this paper is published.
中文摘要:本文提出了一种新颖的预防性范式及TranSUN方法,通过模型微调从本质上消除推荐系统中的重变换偏差,该方法已在淘宝主要推荐场景中成功部署应用。
English Summary: This paper introduces a novel preemptive paradigm and TranSUN method to intrinsically eliminate retransformation bias in recommender systems through model refinement, which has been successfully deployed in Taobao's major recommendation scenarios.

Authors:Jiahao Yu, Haozhuang Liu, Yeqiu Yang, Lu Chen, Jian Wu, Yuning Jiang, Bo Zheng
Title: TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems
Abstract:
Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform with DAU > 300M), to serve the major online traffic.
中文摘要:本文提出了一种新颖的预防性范式及TranSUN方法,通过模型微调从本质上消除推荐系统中的重变换偏差,该方法已在淘宝主要推荐场景中成功部署应用。
English Summary: This paper introduces a novel preemptive paradigm and TranSUN method to intrinsically eliminate retransformation bias in recommender systems through model refinement, which has been successfully deployed in Taobao's major recommendation scenarios.

Authors:Aymeric Capitaine, Maxime Haddouche, Eric Moulines, Michael I. Jordan, Etienne Boursier, Alain Durmus
Title: Online Decision-Focused Learning
Abstract:
Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. This end-to-end strategy holds promise for tackling complex combinatorial problems; however, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging because the objective function has zero or undefined gradients -- which prevents the use of standard first-order optimization methods -- and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) make use of the optimism principle, based on a near-optimal oracle along with an appropriate perturbation. This leads to a practical online algorithm for which we establish bounds on the expected dynamic regret, both when the decision space is a simplex and when it is a general bounded convex polytope. Finally, we demonstrate the effectiveness of our algorithm by comparing its performance with a classic prediction-focused approach on a simple knapsack experiment.
Chinese: 本研究针对动态环境中的决策聚焦学习,提出了两种创新的在线算法,通过正则化和扰动技术解决了目标函数不可微和非凸等难题,并凭借理论保证和实验验证展示了其优越性能。
English: This study introduces two novel online algorithms for decision-focused learning in dynamic environments, addressing challenges like non-differentiable objectives and non-convexity through regularization and perturbation techniques, and demonstrates their superior performance with provable guarantees and experimental validation.

Authors:Aymeric Capitaine, Maxime Haddouche, Eric Moulines, Michael I. Jordan, Etienne Boursier, Alain Durmus
Title: Online Decision-Focused Learning
Abstract:
Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. However, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging for online learning because the objective function has zero or undefined gradients -- which prevents the use of standard first-order optimization methods -- and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) use perturbation techniques along with a near-optimal oracle to overcome non-convexity. Combining those techniques yields two original online algorithms tailored for DFL, for which we establish respectively static and dynamic regret bounds. These are the first provable guarantees for the online decision-focused problem. Finally, we showcase the effectiveness of our algorithms on a knapsack experiment, where they outperform two standard benchmarks.
Chinese: 本研究针对动态环境中的决策聚焦学习,提出了两种创新的在线算法,通过正则化和扰动技术解决了目标函数不可微和非凸等难题,并凭借理论保证和实验验证展示了其优越性能。
English: This study introduces two novel online algorithms for decision-focused learning in dynamic environments, addressing challenges like non-differentiable objectives and non-convexity through regularization and perturbation techniques, and demonstrates their superior performance with provable guarantees and experimental validation.

Authors:Zekun Cai, Yiheng Yao, Guangji Bai, Renhe Jiang, Xuan Song, Ryosuke Shibasaki, Liang Zhao
Title: Continuous Domain Generalization
Abstract:
Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic context. However, existing domain generalization approaches typically treat domains as discrete or evolving along a single axis (e.g., time), which fails to capture the complex, multi-dimensional nature of real-world variation. This paper introduces the task of Continuous Domain Generalization (CDG), which aims to generalize predictive models to unseen domains defined by arbitrary combinations of continuous variation descriptors. We present a principled framework grounded in geometric and algebraic theory, showing that optimal model parameters across domains lie on a low-dimensional manifold. To model this structure, we propose a Neural Lie Transport Operator (NeuralLTO), which enables structured parameter transitions by enforcing geometric continuity and algebraic consistency. To handle noisy or incomplete domain descriptors, we introduce a gating mechanism to suppress irrelevant dimensions and a local chart-based strategy for robust generalization. Extensive experiments on synthetic and real-world datasets-including remote sensing, scientific documents, and traffic forecasting-demonstrate that our method significantly outperforms existing baselines in generalization accuracy and robustness under descriptor imperfections.
中文摘要:本文提出连续域泛化(CDG)任务,通过神经李传输算子建模参数流形结构,结合门控机制与局部坐标策略处理不完整描述符,在遥感、文献和交通预测等数据集中实现了优于现有方法的泛化性能。
English Summary: This paper introduces Continuous Domain Generalization (CDG), addressing multi-dimensional real-world data shifts through a Neural Lie Transport Operator that models parameter manifolds and handles imperfect descriptors with gating mechanisms and local charts, achieving superior generalization across diverse datasets.

Authors:David Noever, Forrest McKee
Title: Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
Abstract:
This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around $250, and an average of $306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth $1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately $1.52 million USD, followed closely by GPT-4o-mini at $1.49 million, then Qwen 2.5 ($1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.
本研究通过构建可扩展的自动评估框架测试大语言模型在自由职业开发中的表现,结果显示克劳德3.5海库模型以152万美元模拟收入领先,其在标准化编程与数据分析任务中展现出最优解决能力。
This study introduces a scalable benchmark to evaluate Large Language Models as autonomous freelance developers, finding Claude 3.5 Haiku achieves the highest simulated earnings of $1.52 million by successfully completing standardized programming and data analysis tasks.

Authors:Jia-Hui Pan, Yeok Tatt Cheah, Zhengzhe Liu, Ka-Hei Hui, Xiaojie Gao, Pheng-Ann Heng, Yun-Hui Liu, Chi-Wing Fu
Title: OPA-Pack: Object-Property-Aware Robotic Bin Packing
Abstract:
Robotic bin packing aids in a wide range of real-world scenarios such as e-commerce and warehouses. Yet, existing works focus mainly on considering the shape of objects to optimize packing compactness and neglect object properties such as fragility, edibility, and chemistry that humans typically consider when packing objects. This paper presents OPA-Pack (Object-Property-Aware Packing framework), the first framework that equips the robot with object property considerations in planning the object packing. Technical-wise, we develop a novel object property recognition scheme with retrieval-augmented generation and chain-of-thought reasoning, and build a dataset with object property annotations for 1,032 everyday objects. Also, we formulate OPA-Net, aiming to jointly separate incompatible object pairs and reduce pressure on fragile objects, while compacting the packing. Further, OPA-Net consists of a property embedding layer to encode the property of candidate objects to be packed, together with a fragility heightmap and an avoidance heightmap to keep track of the packed objects. Then, we design a reward function and adopt a deep Q-learning scheme to train OPA-Net. Experimental results manifest that OPA-Pack greatly improves the accuracy of separating incompatible object pairs (from 52% to 95%) and largely reduces pressure on fragile objects (by 29.4%), while maintaining good packing compactness. Besides, we demonstrate the effectiveness of OPA-Pack on a real packing platform, showcasing its practicality in real-world scenarios.
中文: 本文提出OPA-Pack框架,首次在机器人打包规划中引入物体属性考量,通过新型属性识别和网络设计,显著提升不兼容物体分离准确率并大幅减少易碎物压力,同时保持良好的打包紧凑性。
English: This paper introduces OPA-Pack, the first robotic packing framework that incorporates object properties like fragility and compatibility into packing planning, significantly improving separation of incompatible pairs and reducing pressure on fragile objects while maintaining packing compactness.

Authors:Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, Can Huang
Title: Advancing Sequential Numerical Prediction in Autoregressive Models
Abstract:
Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.
中文: 本文提出数值标记完整性损失(NTIL),通过标记级有序关系保持和序列级差异惩罚的双层方法,增强自回归模型的数值连贯性,显著提升模型性能。
English: This paper proposes the Numerical Token Integrity Loss (NTIL), a dual-level approach that enhances autoregressive models by preserving numerical coherence through token-level ordinal relationships and sequence-level discrepancy penalties, leading to significant performance improvements.

Authors:Zeyi Ren, Jingreng Lei, Yichen Jin, Ermo Hua, Qingfeng Lin, Chen Zhang, Bowen Zhou, Yik-Chung Wu
Title: Deep Unfolding with Kernel-based Quantization in MIMO Detection
Abstract:
The development of edge computing places critical demands on energy-efficient model deployment for multiple-input multiple-output (MIMO) detection tasks. Deploying deep unfolding models such as PGD-Nets and ADMM-Nets into resource-constrained edge devices using quantization methods is challenging. Existing quantization methods based on quantization aware training (QAT) suffer from performance degradation due to their reliance on parametric distribution assumption of activations and static quantization step sizes. To address these challenges, this paper proposes a novel kernel-based adaptive quantization (KAQ) framework for deep unfolding networks. By utilizing a joint kernel density estimation (KDE) and maximum mean discrepancy (MMD) approach to align activation distributions between full-precision and quantized models, the need for prior distribution assumptions is eliminated. Additionally, a dynamic step size updating method is introduced to adjust the quantization step size based on the channel conditions of wireless networks. Extensive simulations demonstrate that the accuracy of proposed KAQ framework outperforms traditional methods and successfully reduces the model's inference latency.
Chinese: 本文提出了一种基于核的自适应量化(KAQ)框架,通过核密度估计消除对激活分布假设的依赖,并根据信道条件动态调整量化步长,显著提升了深度展开网络在MIMO检测中的精度并降低了推理延迟。
English: This paper introduces a kernel-based adaptive quantization (KAQ) framework that enhances deep unfolding networks for MIMO detection by eliminating reliance on activation distribution assumptions through kernel density estimation and dynamically adjusting quantization step sizes based on channel conditions, achieving superior accuracy and reduced inference latency.

Authors:Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, Zhaoxin Fan
Title: Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation
Abstract:
Vision-Language Navigation (VLN) is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex real-world environments. Recent advances in VLN by large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, the role of reasoning strategies in navigation-an action-centric, long-horizon task-remains underexplored, despite Chain-of-Thought (CoT) reasoning's demonstrated success in static tasks like visual question answering. To address this gap, we conduct the first systematic evaluation of reasoning strategies for VLN, including No-Think (direct action prediction), Pre-Think (reason before action), and Post-Think (reason after action). Surprisingly, our findings reveal the Inference-time Reasoning Collapse issue, where inference-time reasoning degrades navigation accuracy, highlighting the challenges of integrating reasoning into VLN. Based on this insight, we propose Aux-Think, a framework that trains models to internalize structured reasoning patterns through CoT supervision, while inferring action directly without reasoning in online prediction. To support this framework, we release R2R-CoT-320k, the first Chain-of-Thought annotated dataset for VLN. Extensive experiments show that Aux-Think reduces training effort greatly and achieves the best performance under the same data scale.
中文: 本研究提出Aux-Think框架,通过思维链监督训练模型内化推理能力,在推理时直接预测动作,解决了视觉语言导航中新发现的推理崩溃问题,并以更少训练成本实现了最佳性能。
English: This study introduces Aux-Think, a framework that trains models to internalize reasoning through Chain-of-Thought supervision while directly predicting actions during inference, addressing the newly identified Inference-time Reasoning Collapse issue in Vision-Language Navigation and achieving superior performance with reduced training effort.

Authors:Xiaoli Lian, Shuaisong Wang, Hanyu Zou, Fang Liu, Jiajun Wu, Li Zhang
Title: Incorporating Verification Standards for Security Requirements Generation from Functional Specifications
Abstract:
In the current software driven era, ensuring privacy and security is critical. Despite this, the specification of security requirements for software is still largely a manual and labor intensive process. Engineers are tasked with analyzing potential security threats based on functional requirements (FRs), a procedure prone to omissions and errors due to the expertise gap between cybersecurity experts and software engineers. To bridge this gap, we introduce F2SRD (Function to Security Requirements Derivation), an automated approach that proactively derives security requirements (SRs) from functional specifications under the guidance of relevant security verification requirements (VRs) drawn from the well recognized OWASP Application Security Verification Standard (ASVS). F2SRD operates in two main phases: Initially, we develop a VR retriever trained on a custom database of FR and VR pairs, enabling it to adeptly select applicable VRs from ASVS. This targeted retrieval informs the precise and actionable formulation of SRs. Subsequently, these VRs are used to construct structured prompts that direct GPT4 in generating SRs. Our comparative analysis against two established models demonstrates F2SRD's enhanced performance in producing SRs that excel in inspiration, diversity, and specificity essential attributes for effective security requirement generation. By leveraging security verification standards, we believe that the generated SRs are not only more focused but also resonate stronger with the needs of engineers.
中文: F2SRD是一种自动化方法,通过利用OWASP ASVS标准从功能规范中主动推导出精确的安全需求,其在生成具有启发性、多样性和针对性的安全需求方面优于现有模型。
English: F2SRD is an automated method that proactively derives precise security requirements from functional specifications by leveraging the OWASP ASVS standard, outperforming existing models in generating inspired, diverse, and specific security requirements.

Authors:Derek Ming Siang Tan, Shailesh, Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, Guillaume Sartoretti
Title: Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild
Abstract:
To perform outdoor autonomous visual navigation and search, a robot may leverage satellite imagery as a prior map. This can help inform high-level search and exploration strategies, even when such images lack sufficient resolution to allow for visual recognition of targets. However, there are limited training datasets of satellite images with annotated targets that are not directly visible. Furthermore, approaches which leverage large Vision Language Models (VLMs) for generalization may yield inaccurate outputs due to hallucination, leading to inefficient search. To address these challenges, we introduce Search-TTA, a multimodal test-time adaptation framework with a flexible plug-and-play interface compatible with various input modalities (e.g. image, text, sound) and planning methods. First, we pretrain a satellite image encoder to align with CLIP's visual encoder to output probability distributions of target presence used for visual search. Second, our framework dynamically refines CLIP's predictions during search using a test-time adaptation mechanism. Through a novel feedback loop inspired by Spatial Poisson Point Processes, uncertainty-weighted gradient updates are used to correct potentially inaccurate predictions and improve search performance. To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data that contains up to 380k training and 8k validation images (in- and out-domain). We find that Search-TTA improves planner performance by up to 30.0%, particularly in cases with poor initial CLIP predictions due to limited training data. It also performs comparably with significantly larger VLMs, and achieves zero-shot generalization to unseen modalities. Finally, we deploy Search-TTA on a real UAV via hardware-in-the-loop testing, by simulating its operation within a large-scale simulation that provides onboard sensing.
中文: 本文提出Search-TTA多模态测试时适应框架,通过动态优化CLIP预测来提升机器人视觉搜索能力,使规划器性能提高达30%,并能实现跨模态的零样本泛化。
English: The paper introduces Search-TTA, a multimodal test-time adaptation framework that enhances visual search in robots by refining CLIP predictions during operation, improving planner performance by up to 30% and enabling zero-shot generalization across modalities.

Authors:Zhirui Fang, Kai Yang, Jian Tao, Jiafei Lyu, Lusong Li, Li Shen, Xiu Li
Title: Exploration by Random Distribution Distillation
Abstract:
Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbf{R}andom \textbf{D}istribution \textbf{D}istillation (RDD), which samples the output of a target network from a normal distribution. RDD facilitates a more extensive exploration by explicitly treating the difference between the prediction network and the target network as an intrinsic reward. Furthermore, by introducing randomness into the output of the target network for a given state and modeling it as a sample from a normal distribution, intrinsic rewards are bounded by two key components: a pseudo-count term ensuring proper exploration decay and a discrepancy term accounting for predictor convergence. We demonstrate that RDD effectively unifies both count-based and prediction-error approaches. It retains the advantages of prediction-error methods in high-dimensional spaces, while also implementing an intrinsic reward decay mode akin to the pseudo-count method. In the experimental section, RDD is compared with more advanced methods in a series of environments. Both theoretical analysis and experimental results confirm the effectiveness of our approach in improving online exploration for reinforcement learning tasks.
Chinese: 本文提出了随机分布蒸馏(RDD)这一新方法,通过从正态分布采样目标网络输出并将预测误差作为内在奖励,有效结合了基于计数和基于预测误差的方法,从而增强了在线强化学习中的探索能力。
English: The paper introduces Random Distribution Distillation (RDD), a novel method that enhances exploration in online reinforcement learning by sampling target network outputs from a normal distribution and using the prediction error as an intrinsic reward, effectively unifying count-based and prediction-error approaches.

Authors:Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo
Title: Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents
Abstract:
Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial's overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.
中文: VulTrial是一个法庭启发的多智能体框架,通过四个专业角色显著提升漏洞检测性能,在使用GPT-4o时比基线提升超100%,并通过角色特定微调获得进一步改善,同时证明使用GPT-3.5能在更低成本下实现性能提升。
English: VulTrial, a courtroom-inspired multi-agent framework with four specialized roles, significantly enhances vulnerability detection by outperforming single-agent and multi-agent baselines, achieving over 100% performance improvement with GPT-4o and further gains through role-specific tuning.

Authors:Qing Yu, Xiaobei Wang, Shuchang Liu, Yandong Bai, Xiaoyu Yang, Xueliang Wang, Chang Meng, Shanshan Wu, Hailan Yang, Huihui Xiao, Xiang Li, Fan Yang, Xiaoqiang Feng, Lantao Hu, Han Li, Kun Gai, Lixin Zou
Title: Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation
Abstract:
Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks.
中文: 该摘要提出了TagCF框架,通过整合大语言模型与推荐系统来建模用户社会角色和行为逻辑,弥补了传统方法忽视用户特征的不足,并经验证显著提升了推荐性能。
English: This abstract introduces TagCF, a framework that integrates Large Language Models with recommender systems to model user social roles and behavioral logic, addressing the neglect of user characteristics in traditional approaches and demonstrating enhanced performance through empirical validation.

Authors:Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
Title: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Abstract:
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
中文:MMLongBench作为首个全面评估长上下文视觉语言模型的基准,涵盖多样化任务和图像类型,揭示了现有模型在长上下文任务中仍面临挑战,且更强的推理能力与更优表现相关。
English: MMLongBench is introduced as the first comprehensive benchmark for evaluating long-context vision-language models across diverse tasks and image types, revealing that current models still struggle with long-context challenges and that stronger reasoning correlates with better performance.

Authors:Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
Title: MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Abstract:
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
中文:MMLongBench作为首个全面评估长上下文视觉语言模型的基准,涵盖多样化任务和图像类型,揭示了现有模型在长上下文任务中仍面临挑战,且更强的推理能力与更优表现相关。
English: MMLongBench is introduced as the first comprehensive benchmark for evaluating long-context vision-language models across diverse tasks and image types, revealing that current models still struggle with long-context challenges and that stronger reasoning correlates with better performance.

Authors:Chengsen Wang, Qi Qi, Zhongwen Rao, Lujia Pan, Jingyu Wang, Jianxin Liao
Title: ChronoSteer: Bridging Large Language Model and Time Series Foundation Model via Synthetic Data
Abstract:
Conventional forecasting methods rely on unimodal time series data, limiting their ability to exploit rich textual information. Recently, large language models (LLMs) and time series foundation models (TSFMs) have demonstrated powerful capability in textual reasoning and temporal modeling, respectively. Integrating the strengths of both to construct a multimodal model that concurrently leverages both temporal and textual information for future inference has emerged as a critical research challenge. To address the scarcity of event-series paired data, we propose a decoupled framework: an LLM is employed to transform textual events into revision instructions, which are then used to steer the output of TSFM. To implement this framework, we introduce ChronoSteer, a multimodal TSFM that can be steered through textual revision instructions, effectively bridging LLM and TSFM. Moreover, to mitigate the shortage of cross-modal instruction-series paired data, we devise a two-stage training strategy based on synthetic data. In addition, we also construct a high-quality multimodal time series forecasting benchmark to address the information leakage concerns during evaluation. After integrating with an LLM, ChronoSteer, which is trained exclusively on synthetic data, achieves a 25.7% improvement in prediction accuracy compared to the unimodal backbone and a 22.5% gain over the previous state-of-the-art multimodal method.
中文:ChronoSteer是一种创新的多模态框架,通过文本修订指令整合大型语言模型与时间序列基础模型,在利用时间和文本数据的基础上显著提升了预测准确性。
English: ChronoSteer is a novel multimodal framework that integrates large language models and time series foundation models through textual revision instructions, achieving significant accuracy improvements in forecasting by leveraging both temporal and textual data.

Authors:Shivam Sood, Laukik B Nakhwa, Yuhong Cao, Sun Ge, Guillaume Sartoretti
Title: APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots
Abstract:
Learning by imitation provides an effective way for robots to develop well-regulated complex behaviors and directly benefit from natural demonstrations. State-of-the-art imitation learning (IL) approaches typically leverage Adversarial Motion Priors (AMP), which, despite their impressive results, suffer from two key limitations. They are prone to mode collapse, which often leads to overfitting to the simulation environment and thus increased sim-to-real gap, and they struggle to learn diverse behaviors effectively. To overcome these limitations, we introduce APEX (Action Priors enable Efficient eXploration): a simple yet versatile IL framework that integrates demonstrations directly into reinforcement learning (RL), maintaining high exploration while grounding behavior with expert-informed priors. We achieve this through a combination of decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is complemented by a multi-critic RL framework that effectively balances stylistic consistency with task performance. Our approach achieves sample-efficient IL and enables the acquisition of diverse skills within a single policy. APEX generalizes to varying velocities and preserves reference-like styles across complex tasks such as navigating rough terrain and climbing stairs, utilizing only flat-terrain kinematic motion data as a prior. We validate our framework through extensive hardware experiments on the Unitree Go2 quadruped. There, APEX yields diverse and agile locomotion gaits, inherent gait transitions, and the highest reported speed for the platform to the best of our knowledge (peak velocity of ~3.3 m/s on hardware). Our results establish APEX as a compelling alternative to existing IL methods, offering better efficiency, adaptability, and real-world performance. https://marmotlab.github.io/APEX/
中文摘要:APEX是一种新颖的模仿学习框架,通过衰减动作先验和多评价器机制将示范融入强化学习,能在四足机器人上高效学习多样化运动技能并实现优越的现实世界性能。
English Summary: APEX is a novel imitation learning framework that integrates demonstrations into reinforcement learning with decaying action priors and a multi-critic setup, enabling efficient learning of diverse locomotion skills with improved real-world performance on quadruped robots.

Authors:Ying Zang, Yuanqi Hu, Xinyu Chen, Yuxia Xu, Suhui Wang, Chunan Yu, Lanyun Zhu, Deyi Ji, Xin Xu, Tianrun Chen
Title: From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching
Abstract:
In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.
中文: 本文提出一种三维草图驱动的服装生成框架,通过AR/VR环境中的简易草图让普通用户也能创建个性化数字服装,结合条件扩散模型与自适应课程学习策略的新方法,经新数据集和用户研究验证显著提升了生成质量与易用性。
English: This paper presents a 3D sketch-driven garment generation framework that enables novice users to create personalized digital clothing through simple AR/VR sketches, overcoming technical barriers with a novel diffusion model and curriculum learning approach validated by a new dataset and user studies.

Authors:Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher, Steven A Niederer, Kaisar Kushibar, Carlos Martin-Isla, Petia Radeva, Karim Lekadir, Theodore Barfoot, Luis C. Garcia Peraza Herrera, Ben Glocker, Tom Vercauteren, Lucas Gago, Justin Englemann, Joy-Marie Kleiss, Anton Aubanell, Andreu Antolin, Javier Garcia-Lopez, Miguel A. Gonzalez Ballester, Adrian Galdran
Title: Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results
Abstract:
Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
中文:CURVAS框架通过利用多标注者差异性来改进模型校准和不确定性估计,解决了医学图像分割中的关键挑战,研究表明基于多样化数据训练且校准良好的模型能够产生更稳健和临床可靠的结果。
English: The CURVAS framework addresses key challenges in medical image segmentation by leveraging multi-annotator variability to improve model calibration and uncertainty estimation, demonstrating that well-calibrated models trained on diverse datasets yield more robust and clinically reliable results.

Authors:Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher, Steven A Niederer, Kaisar Kushibar, Carlos Martin-Isla, Petia Radeva, Karim Lekadir, Theodore Barfoot, Luis C. Garcia Peraza Herrera, Ben Glocker, Tom Vercauteren, Lucas Gago, Justin Englemann, Joy-Marie Kleiss, Anton Aubanell, Andreu Antolin, Javier Garcia-Lopez, Miguel A. Gonzalez Ballester, Adrian Galdran
Title: Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results
Abstract:
Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
中文:CURVAS框架通过利用多标注者差异性来改进模型校准和不确定性估计,解决了医学图像分割中的关键挑战,研究表明基于多样化数据训练且校准良好的模型能够产生更稳健和临床可靠的结果。
English: The CURVAS framework addresses key challenges in medical image segmentation by leveraging multi-annotator variability to improve model calibration and uncertainty estimation, demonstrating that well-calibrated models trained on diverse datasets yield more robust and clinically reliable results.

Authors:Ruikun Hou, Babette Bühler, Tim Fütterer, Efe Bozkir, Peter Gerjets, Ulrich Trautwein, Enkelejda Kasneci
Title: Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach
Abstract:
Classroom discourse is an essential vehicle through which teaching and learning take place. Assessing different characteristics of discursive practices and linking them to student learning achievement enhances the understanding of teaching quality. Traditional assessments rely on manual coding of classroom observation protocols, which is time-consuming and costly. Despite many studies utilizing AI techniques to analyze classroom discourse at the utterance level, investigations into the evaluation of discursive practices throughout an entire lesson segment remain limited. To address this gap, our study proposes a novel text-centered multimodal fusion architecture to assess the quality of three discourse components grounded in the Global Teaching InSights (GTI) observation protocol: Nature of Discourse, Questioning, and Explanations. First, we employ attention mechanisms to capture inter- and intra-modal interactions from transcript, audio, and video streams. Second, a multi-task learning approach is adopted to jointly predict the quality scores of the three components. Third, we formulate the task as an ordinal classification problem to account for rating level order. The effectiveness of these designed elements is demonstrated through an ablation study on the GTI Germany dataset containing 92 videotaped math lessons. Our results highlight the dominant role of text modality in approaching this task. Integrating acoustic features enhances the model's consistency with human ratings, achieving an overall Quadratic Weighted Kappa score of 0.384, comparable to human inter-rater reliability (0.326). Our study lays the groundwork for the future development of automated discourse quality assessment to support teacher professional development through timely feedback on multidimensional discourse practices.
中文: 本研究提出了一种结合文本、音频和视频的多模态融合模型,用于自动评估课堂话语质量,其表现与人工评估相当,有助于教师专业发展。
English: This study introduces a multimodal fusion model using text, audio, and video to automatically assess classroom discourse quality, achieving performance comparable to human raters and supporting teacher development.

Authors:Zhuo Song, Ye Zhang, Kunhong Li, Longguang Wang, Yulan Guo
Title: A Unified Hierarchical Framework for Fine-grained Cross-view Geo-localization over Large-scale Scenarios
Abstract:
Cross-view geo-localization is a promising solution for large-scale localization problems, requiring the sequential execution of retrieval and metric localization tasks to achieve fine-grained predictions. However, existing methods typically focus on designing standalone models for these two tasks, resulting in inefficient collaboration and increased training overhead. In this paper, we propose UnifyGeo, a novel unified hierarchical geo-localization framework that integrates retrieval and metric localization tasks into a single network. Specifically, we first employ a unified learning strategy with shared parameters to jointly learn multi-granularity representation, facilitating mutual reinforcement between these two tasks. Subsequently, we design a re-ranking mechanism guided by a dedicated loss function, which enhances geo-localization performance by improving both retrieval accuracy and metric localization references. Extensive experiments demonstrate that UnifyGeo significantly outperforms the state-of-the-arts in both task-isolated and task-associated settings. Remarkably, on the challenging VIGOR benchmark, which supports fine-grained localization evaluation, the 1-meter-level localization recall rate improves from 1.53\% to 39.64\% and from 0.43\% to 25.58\% under same-area and cross-area evaluations, respectively. Code will be made publicly available.
中文: UnifyGeo是一个统一的分层地理定位框架,将检索和度量定位任务整合到单一网络中,通过联合多粒度学习和重排序机制显著提升了定位性能。
English: UnifyGeo is a unified hierarchical framework that integrates retrieval and metric localization tasks into a single network, significantly enhancing geo-localization performance through joint multi-granularity learning and a re-ranking mechanism.

Authors:Efe Bozkir, Christian Kosel, Tina Seidel, Enkelejda Kasneci
Title: Automated Visual Attention Detection using Mobile Eye Tracking in Behavioral Classroom Studies
Abstract:
Teachers' visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers' gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers' visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.
中文: 本研究提出一种自动化处理流程,结合人脸检测、识别技术及移动眼动仪数据,以最少人工标注识别教师课堂视觉关注对象,在U形和小型教室中准确率分别达约0.7和0.9,有望优化教学策略与教师专业发展。
English: The study introduces an automated pipeline using face detection and recognition with minimal manual data to identify which students teachers focus on via mobile eye tracking, showing promising accuracy especially in U-shaped and small classrooms, which could enhance teaching strategies and professional development.

Authors:Suleyman Ozdel, Can Sarpkaya, Efe Bozkir, Hong Gao, Enkelejda Kasneci
Title: Examining the Role of LLM-Driven Interactions on Attention and Cognitive Engagement in Virtual Classrooms
Abstract:
Transforming educational technologies through the integration of large language models (LLMs) and virtual reality (VR) offers the potential for immersive and interactive learning experiences. However, the effects of LLMs on user engagement and attention in educational environments remain open questions. In this study, we utilized a fully LLM-driven virtual learning environment, where peers and teachers were LLM-driven, to examine how students behaved in such settings. Specifically, we investigate how peer question-asking behaviors influenced student engagement, attention, cognitive load, and learning outcomes and found that, in conditions where LLM-driven peer learners asked questions, students exhibited more targeted visual scanpaths, with their attention directed toward the learning content, particularly in complex subjects. Our results suggest that peer questions did not introduce extraneous cognitive load directly, as the cognitive load is strongly correlated with increased attention to the learning material. Considering these findings, we provide design recommendations for optimizing VR learning spaces.
中文摘要:通过将大型语言模型与虚拟现实结合于教育中,同伴提问能引导学生注意力聚焦于学习内容,增强参与度且不增加额外认知负担,从而优化沉浸式学习效果。
English Summary: Integrating large language models with virtual reality in education enhances student engagement by directing attention to learning content through peer-driven questions, without increasing cognitive load, leading to more effective immersive learning experiences.

Authors:Hao Xu, Yinqiao Wang, Niloy J. Mitra, Shuaicheng Liu, Pheng-Ann Heng, Chi-Wing Fu
Title: Hand-Shadow Poser
Abstract:
Hand shadow art is a captivating art form, creatively using hand shadows to reproduce expressive shapes on the wall. In this work, we study an inverse problem: given a target shape, find the poses of left and right hands that together best produce a shadow resembling the input. This problem is nontrivial, since the design space of 3D hand poses is huge while being restrictive due to anatomical constraints. Also, we need to attend to the input's shape and crucial features, though the input is colorless and textureless. To meet these challenges, we design Hand-Shadow Poser, a three-stage pipeline, to decouple the anatomical constraints (by hand) and semantic constraints (by shadow shape): (i) a generative hand assignment module to explore diverse but reasonable left/right-hand shape hypotheses; (ii) a generalized hand-shadow alignment module to infer coarse hand poses with a similarity-driven strategy for selecting hypotheses; and (iii) a shadow-feature-aware refinement module to optimize the hand poses for physical plausibility and shadow feature preservation. Further, we design our pipeline to be trainable on generic public hand data, thus avoiding the need for any specialized training dataset. For method validation, we build a benchmark of 210 diverse shadow shapes of varying complexity and a comprehensive set of metrics, including a novel DINOv2-based evaluation metric. Through extensive comparisons with multiple baselines and user studies, our approach is demonstrated to effectively generate bimanual hand poses for a large variety of hand shapes for over 85% of the benchmark cases.
中文: 本研究提出Hand-Shadow Poser三阶段流程,通过解耦解剖约束与语义约束,成功解决了根据目标阴影生成双手姿态的逆向问题,在无需专用训练数据的情况下对85%以上的测试案例有效生成双手姿势。
English: This research introduces Hand-Shadow Poser, a three-stage pipeline that solves the inverse problem of generating anatomically plausible bimanual hand poses to create target shadow shapes, achieving over 85% success on a diverse benchmark without requiring specialized training data.

Authors:Fuhui Zhou, Chunyu Liu, Hao Zhang, Wei Wu, Qihui Wu, Derrick Wing Kwan Ng, Tony Q. S. Quek, Chan-Byoung Chae
Title: SpectrumFM: A Foundation Model for Intelligent Spectrum Management
Abstract:
Intelligent spectrum management is crucial for improving spectrum efficiency and achieving secure utilization of spectrum resources. However, existing intelligent spectrum management methods, typically based on small-scale models, suffer from notable limitations in recognition accuracy, convergence speed, and generalization, particularly in the complex and dynamic spectrum environments. To address these challenges, this paper proposes a novel spectrum foundation model, termed SpectrumFM, establishing a new paradigm for spectrum management. SpectrumFM features an innovative encoder architecture that synergistically exploits the convolutional neural networks and the multi-head self-attention mechanisms to enhance feature extraction and enable robust representation learning. The model is pre-trained via two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, which leverage large-scale in-phase and quadrature (IQ) data to achieve comprehensive and transferable spectrum representations. Furthermore, a parameter-efficient fine-tuning strategy is proposed to enable SpectrumFM to adapt to various downstream spectrum management tasks, including automatic modulation classification (AMC), wireless technology classification (WTC), spectrum sensing (SS), and anomaly detection (AD). Extensive experiments demonstrate that SpectrumFM achieves superior performance in terms of accuracy, robustness, adaptability, few-shot learning efficiency, and convergence speed, consistently outperforming conventional methods across multiple benchmarks. Specifically, SpectrumFM improves AMC accuracy by up to 12.1% and WTC accuracy by 9.3%, achieves an area under the curve (AUC) of 0.97 in SS at -4 dB signal-to-noise ratio (SNR), and enhances AD performance by over 10%.
中文摘要:本文提出了一种名为SpectrumFM的新型频谱基础模型,通过结合卷积神经网络和多头自注意力机制,利用自监督预训练和高效微调策略,在多种频谱管理任务中实现了精度、鲁棒性和适应性的显著提升。
English Summary: This paper introduces SpectrumFM, a novel spectrum foundation model that integrates convolutional neural networks and multi-head self-attention mechanisms, achieving superior performance in accuracy, robustness, and adaptability across various spectrum management tasks through self-supervised pre-training and efficient fine-tuning.

Authors:Dongqian Guo, Wencheng Han, Pang Lyu, Yuxi Zhou, Jianbing Shen
Title: Towards Better Cephalometric Landmark Detection with Diffusion Data Generation
Abstract:
Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: https://um-lab.github.io/cepha-generation
中文摘要:本研究提出了一种创新的数据生成方法,通过解剖学先验和扩散模型生成具有自动标注的多样化头影测量X射线图像,使大规模视觉模型成功检测率达到82.2%,显著提升了标志点检测性能。
English Summary: This study introduces an innovative data generation method using anatomical priors and diffusion models to create diverse cephalometric X-ray images with automated annotations, enabling large-scale vision models to achieve an 82.2% success detection rate and significantly improving landmark detection performance.

Authors:Minkyu Choi, Yunhao Yang, Neel P. Bhatt, Kushagra Gupta, Sahil Shah, Aditya Rai, David Fridovich-Keil, Ufuk Topcu, Sandeep P. Chinchali
Title: Real-Time Privacy Preservation for Robot Visual Perception
Abstract:
Many robots (e.g., iRobot's Roomba) operate based on visual observations from live video streams, and such observations may inadvertently include privacy-sensitive objects, such as personal identifiers. Existing approaches for preserving privacy rely on deep learning models, differential privacy, or cryptography. They lack guarantees for the complete concealment of all sensitive objects. Guaranteeing concealment requires post-processing techniques and thus is inadequate for real-time video streams. We develop a method for privacy-constrained video streaming, PCVS, that conceals sensitive objects within real-time video streams. PCVS takes a logical specification constraining the existence of privacy-sensitive objects, e.g., never show faces when a person exists. It uses a detection model to evaluate the existence of these objects in each incoming frame. Then, it blurs out a subset of objects such that the existence of the remaining objects satisfies the specification. We then propose a conformal prediction approach to (i) establish a theoretical lower bound on the probability of the existence of these objects in a sequence of frames satisfying the specification and (ii) update the bound with the arrival of each subsequent frame. Quantitative evaluations show that PCVS achieves over 95 percent specification satisfaction rate in multiple datasets, significantly outperforming other methods. The satisfaction rate is consistently above the theoretical bounds across all datasets, indicating that the established bounds hold. Additionally, we deploy PCVS on robots in real-time operation and show that the robots operate normally without being compromised when PCVS conceals objects.
中文: PCVS方法通过逻辑规范识别并模糊实时视频流中的敏感对象,实现了超过95%的规范满足率,具备理论保障且不影响机器人正常运行。
English: The proposed PCVS method ensures real-time privacy in video streams by logically specifying and blurring sensitive objects, achieving over 95% specification satisfaction with theoretical guarantees and seamless robot operation.

Authors:Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G. T. Barrett, David Stutz, Nenad Tomasev, Anil Palepu, Valentin Liévin, Yash Sharma, Roma Ruparel, Abdullah Ahmed, Elahe Vedadi, Kimberly Kanada, Cian Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Sara Mahdavi, James Manyika, Katherine Chou, Yossi Matias, Avinatan Hassidim, Dale R. Webster, Pushmeet Kohli, S. M. Ali Eslami, Joëlle Barral, Adam Rodman, Vivek Natarajan, Mike Schaekermann, Tao Tu, Alan Karthikesalingam, Ryutaro Tanno
Title: Advancing Conversational Diagnostic AI with Multimodal Reasoning
Abstract:
Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
中文: 大型语言模型AMIE在诊断对话中实现了多模态医疗数据的解读能力,在结构化评估中表现优于初级保健医生,但实际应用仍需进一步研究。
English: Large language models like AMIE have advanced to interpret multimodal medical data during diagnostic conversations, showing superior performance to primary care physicians in structured evaluations but requiring further research for real-world application.

Authors:Xueyao Zhang, Yuancheng Wang, Chaoren Wang, Ziniu Li, Zhuo Chen, Zhizheng Wu
Title: Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Abstract:
Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. To address these limitations, this paper leverages preference alignment techniques, which enable targeted construction of out-of-pretraining-distribution data to enhance performance. We introduce a new dataset, named the Intelligibility Preference Speech Dataset (INTP), and extend the Direct Preference Optimization (DPO) framework to accommodate diverse TTS architectures. After INTP alignment, in addition to intelligibility, we observe overall improvements including naturalness, similarity, and audio quality for multiple TTS models across diverse domains. Based on that, we also verify the weak-to-strong generalization ability of INTP for more intelligible models such as CosyVoice 2 and Ints. Moreover, we showcase the potential for further improvements through iterative alignment based on Ints. Audio samples are available at https://intalign.github.io/.
中文: 本文提出可懂度偏好语音数据集INTP并扩展直接偏好优化框架,有效提升了零样本语音合成系统在可懂度、自然度和音质方面的表现,同时验证了弱到强泛化能力和迭代优化的潜力。
English: This paper introduces the Intelligibility Preference Speech Dataset (INTP) and extends Direct Preference Optimization to enhance zero-shot TTS systems, achieving improved intelligibility, naturalness, and audio quality across multiple models while demonstrating weak-to-strong generalization and iterative alignment potential.

Authors:David Noever, Forrest McKee
Title: Alpha Excel Benchmark
Abstract:
This study presents a novel benchmark for evaluating Large Language Models (LLMs) using challenges derived from the Financial Modeling World Cup (FMWC) Excel competitions. We introduce a methodology for converting 113 existing FMWC challenges into programmatically evaluable JSON formats and use this dataset to compare the performance of several leading LLMs. Our findings demonstrate significant variations in performance across different challenge categories, with models showing specific strengths in pattern recognition tasks but struggling with complex numerical reasoning. The benchmark provides a standardized framework for assessing LLM capabilities in realistic business-oriented tasks rather than abstract academic problems. This research contributes to the growing field of AI benchmarking by establishing proficiency among the 1.5 billion people who daily use Microsoft Excel as a meaningful evaluation metric that bridges the gap between academic AI benchmarks and practical business applications.
本研究通过金融Excel挑战创建新基准来评估大语言模型,发现不同任务表现差异显著,在连接学术AI标准与实际商业应用方面具有突破意义。
This study introduces a new benchmark using financial Excel challenges to evaluate LLMs, revealing varied performance across tasks while bridging academic AI metrics with practical business applications.

Authors:Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
Title: Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
Chinese: “绝对零度”范式让模型能够自主生成任务并在无外部数据的情况下提升推理能力,在编程和数学任务上实现了最先进的性能。
English: The Absolute Zero paradigm enables a model to self-generate tasks and improve reasoning without external data, achieving state-of-the-art performance on coding and math tasks.

Authors:Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl, Maurice Aschwanden, Qi Zeng, Davood Karimi, Denis Peruzzo, Tommaso Ciceri, Giorgio Longari, Rachika E. Hamadache, Amina Bouzid, Xavier Lladó, Simone Chiarella, Gerard Martí-Juan, Miguel Ángel González Ballester, Marco Castellaro, Marco Pinamonti, Valentina Visani, Robin Cremese, Keïn Sam, Fleur Gaudfernau, Param Ahir, Mehul Parikh, Maximilian Zenk, Michael Baumgartner, Klaus Maier-Hein, Li Tianhong, Yang Hong, Zhao Longfei, Domen Preloznik, Žiga Špiclin, Jae Won Choi, Muyang Li, Jia Fu, Guotai Wang, Jingwen Jiang, Lyuyang Tong, Bo Du, Andrea Gondova, Sungmin You, Kiho Im, Abdul Qayyum, Moona Mazher, Steven A Niederer, Andras Jakab, Roxane Licandro, Kelly Payette, Meritxell Bach Cuadra
Title: Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge
Abstract:
Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.
中文: FeTA 2024挑战赛通过引入生物测量预测与组织分割任务,推进了胎儿脑部MRI分析,发现分割精度已接近人类水平,同时揭示了仅从图像数据获取可靠生物测量的挑战,并强调了数据多样性和拓扑评估对开发稳健AI工具的重要性。
English: The FeTA Challenge 2024 advanced fetal brain MRI analysis by introducing biometry prediction alongside tissue segmentation, revealing segmentation accuracy approaching human-level performance while highlighting challenges in biometric estimation and the importance of data diversity and topological evaluation for robust AI tools.

Authors:Chuxue Cao, Zhenghao Zhu, Junqi Zhu, Guoying Lu, Siyu Peng, Juntao Dai, Weijie Shi, Sirui Han, Yike Guo
Title: Measuring Hong Kong Massive Multi-Task Language Understanding
Abstract:
Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong's unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong's linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75\%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs' capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.
中文: 针对香港独特语言文化背景评估基准的缺失,HKMMLU基准的推出揭示了包括DeepSeek-V3在内的最优模型在香港特定知识领域表现不佳,凸显了提升大语言模型多语言理解能力的迫切需求。
English: To address the lack of evaluation benchmarks for Hong Kong's unique linguistic and cultural context, HKMMLU is introduced as a comprehensive benchmark revealing that even top-performing LLMs like DeepSeek-V3 struggle with Hong Kong-specific knowledge, highlighting the need for improved multilingual capabilities.

Authors:Yuchen Wang, Xuefeng Bai, Xiucheng Li, Weili Guan, Liqiang Nie, Xinyang Chen
Title: Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Abstract:
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/
Chinese: 本研究针对视觉语言模型生成的伪标签不平衡问题,揭示了概念不匹配与概念混淆两大成因,提出了增强弱势类别的新框架,在六个基准数据集上实现了相比现有最优方法6.29%的相对性能提升。
English: This study addresses the imbalance in pseudolabels generated by vision-language models by identifying concept mismatch and confusion as key causes, proposing a novel framework that enhances underperforming classes and achieves a 6.29% improvement over state-of-the-art methods.

Authors:Theo Guidroz, Diego Ardila, Jimmy Li, Adam Mansour, Paul Jhun, Nina Gonzalez, Xiang Ji, Mike Sanchez, Sujay Kakarmath, Mathias MJ Bellaiche, Miguel Ángel Garrido, Faruk Ahmed, Divyansh Choudhary, Jay Hartford, Chenwei Xu, Henry Javier Serrano Echeverria, Yifan Wang, Jeff Shaffer, Eric, Cao, Yossi Matias, Avinatan Hassidim, Dale R Webster, Yun Liu, Sho Fujiwara, Peggy Bui, Quang Duong
Title: LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load
Abstract:
Information on the web, such as scientific publications and Wikipedia, often surpasses users' reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p<0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants' self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p<0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility.
中文: 本研究通过4563名参与者的随机试验证明,大型语言模型能有效简化复杂文本,阅读简化版本的参与者不仅理解正确率显著提高,还反馈任务难度明显降低。
English: This study demonstrates that a large language model can effectively simplify complex texts, as shown by a randomized trial with 4563 participants where those reading simplified versions achieved significantly higher comprehension scores and reported lower perceived difficulty.

Authors:Md Arafat Habib, Pedro Enrique Iturria Rivera, Yigit Ozcan, Medhat Elsayed, Majid Bavand, Raimundas Gaigalas, Melike Erol-Kantarci
Title: Harnessing the Power of LLMs, Informers and Decision Transformers for Intent-driven RAN Management in 6G
Abstract:
Intent-driven network management is critical for managing the complexity of 5G and 6G networks. It enables adaptive, on-demand management of the network based on the objectives of the network operators. In this paper, we propose an innovative three-step framework for intent-driven network management based on Generative AI (GenAI) algorithms. First, we fine-tune a Large Language Model (LLM) on a custom dataset using a Quantized Low-Rank Adapter (QLoRA) to enable memory-efficient intent processing within limited computational resources. A Retrieval Augmented Generation (RAG) module is included to support dynamic decision-making. Second, we utilize a transformer architecture for time series forecasting to predict key parameters, such as power consumption, traffic load, and packet drop rate, to facilitate intent validation proactively. Lastly, we introduce a Hierarchical Decision Transformer with Goal Awareness (HDTGA) to optimize the selection and orchestration of network applications and hence, optimize the network. Our intent guidance and processing approach improves BERTScore by 6% and the semantic similarity score by 9% compared to the base LLM model. Again, the proposed predictive intent validation approach can successfully rule out the performance-degrading intents with an average of 88% accuracy. Finally, compared to the baselines, the proposed HDTGA algorithm increases throughput at least by 19.3%, reduces delay by 48.5%, and boosts energy efficiency by 54.9%.
中文: 本文提出了一种基于生成式人工智能的三步框架,通过QLoRA微调大语言模型实现高效意图处理,采用时序预测进行主动验证,并利用分层决策变换器优化网络,在各项性能指标上均取得显著提升。
English: This paper introduces a GenAI-based three-step framework for intent-driven network management, utilizing LLM fine-tuning with QLoRA for efficient intent processing, transformer-based forecasting for proactive validation, and HDTGA for network optimization, achieving significant improvements in performance metrics.

Authors:Han Wan, Rui Zhang, Qi Wang, Yang Liu, Hao Sun
Title: PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems
Abstract:
Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.
Chinese: 提出的物理编码谱注意力网络(PeSANet)通过结合局部物理约束和全局谱注意力机制,能够在数据有限和物理知识不完整的情况下准确预测复杂系统,在长期预测精度上展现出卓越性能。
English: The proposed Physics-encoded Spectral Attention Network (PeSANet) integrates local physics constraints and global spectral attention to accurately forecast complex systems with limited data and incomplete physical knowledge, demonstrating superior performance in long-term predictions.

Authors:Ziyan Guo, Haoxuan Qu, Hossein Rahmani, Dewen Soh, Ping Hu, Qiuhong Ke, Jun Liu
Title: TSTMotion: Training-free Scene-aware Text-to-motion Generation
Abstract:
Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{https://tstmotion.github.io/}{Project Page}.
中文: 提出的TSTMotion框架无需训练,通过利用基础模型生成和验证三维场景中的运动引导,有效增强了预训练的空白背景运动生成器的场景感知能力。
English: The proposed TSTMotion framework is a training-free solution that enhances pre-trained blank-background motion generators with scene-aware capabilities by leveraging foundation models to create and validate motion guidance for 3D environments.

Authors:Chengsen Wang, Qi Qi, Jingyu Wang, Haifeng Sun, Zirui Zhuang, Jianxin Liao
Title: Unlocking the Potential of Linear Networks for Irregular Multivariate Time Series Forecasting
Abstract:
Time series forecasting holds significant importance across various industries, including finance, transportation, energy, healthcare, and climate. Despite the widespread use of linear networks due to their low computational cost and effectiveness in modeling temporal dependencies, most existing research has concentrated on regularly sampled and fully observed multivariate time series. However, in practice, we frequently encounter irregular multivariate time series characterized by variable sampling intervals and missing values. The inherent intra-series inconsistency and inter-series asynchrony in such data hinder effective modeling and forecasting with traditional linear networks relying on static weights. To tackle these challenges, this paper introduces a novel model named AiT. AiT utilizes an adaptive linear network capable of dynamically adjusting weights according to observation time points to address intra-series inconsistency, thereby enhancing the accuracy of temporal dependencies modeling. Furthermore, by incorporating the Transformer module on variable semantics embeddings, AiT efficiently captures variable correlations, avoiding the challenge of inter-series asynchrony. Comprehensive experiments across four benchmark datasets demonstrate the superiority of AiT, improving prediction accuracy by 11% and decreasing runtime by 52% compared to existing state-of-the-art methods.
中文: 本文提出AiT模型,通过自适应线性网络和Transformer模块有效处理具有可变采样间隔和缺失值的不规则多元时间序列,相比现有最优方法,预测精度提升11%,运行时间减少52%。
English: This paper introduces AiT, a novel model that employs an adaptive linear network and Transformer module to effectively handle irregular multivariate time series with variable sampling intervals and missing values, achieving an 11% improvement in prediction accuracy and 52% reduction in runtime over state-of-the-art methods.

Authors:Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan
Title: A Simple Linear Patch Revives Layer-Pruned Large Language Models
Abstract:
Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.
中文摘要:层剪枝技术因简单性被广泛用于压缩大语言模型,但现有方法常导致性能显著下降;LinearPatch通过哈达玛变换抑制异常值并进行通道缩放,以可忽略的推理开销有效弥补剪枝接口的激活不匹配,在LLaMA-3-8B上剪除5层后仍保持94.15%性能,优于现有最佳方法4%。
English Summary: Layer pruning in large language models often causes performance loss due to activation mismatches, but LinearPatch, a plug-and-play technique using Hadamard transformation and channel-wise scaling, effectively bridges this gap with minimal overhead, achieving up to 94.15% performance retention on LLaMA-3-8B and surpassing existing methods by 4%.

Authors:Sakhinana Sagar Srinivas, Shivam Gupta, Venkataramana Runkana
Title: AutoChemSchematic AI: Agentic Physics-Aware Automation for Chemical Manufacturing Scale-Up
Abstract:
Recent advances in generative AI have accelerated the discovery of novel chemicals and materials. However, scaling these discoveries to industrial production remains a major bottleneck due to the synthesis gap -- the need to develop entirely new manufacturing processes. This challenge requires detailed engineering blueprints: PFDs for equipment layouts and material/energy flows, and PIDs for process plant operations. Current AI systems cannot yet reliably generate these critical engineering schematics, creating a fundamental obstacle to manufacturing scale-up of novel discoveries. We present a closed-loop, physics-aware framework for automated generation of industrially viable PFDs and PIDs. The framework integrates three key components: (1) domain-specialized small language models (SLMs) trained for auto-generation of PFDs and PIDs, (2) a hierarchical knowledge graph containing process flow and instrumentation descriptions for 1,020+ chemicals for Graph Retrieval-Augmented Generation (GRAG), and (3) an open-source chemical process simulator for modeling, simulation, optimization, and analysis of novel chemical processes. The SLMs are trained through a multi-stage pipeline on synthetic datasets, with process simulator-in-the-loop validation ensuring feasibility. To enhance computational efficiency, the framework implements structural pruning (width and depth) guided by importance heuristics to reduce language model size while preserving accuracy, followed by advanced inference optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test-Time Inference Scaling. Experimental results demonstrate that our framework generates simulator-validated process descriptions with high fidelity.
中文: 针对新化学品发现与工业化生产之间的合成鸿沟,我们开发了物理感知的闭环框架,通过专业小语言模型、分级知识图谱和开源模拟器自动生成经模拟验证的工业流程图与管道仪表图,有效解决了规模化生产瓶颈。
English: Recent generative AI advances in chemical discovery face a manufacturing bottleneck, which our new physics-aware framework overcomes by automatically generating validated industrial blueprints (PFDs/PIDs) using specialized language models, knowledge graphs, and process simulation.

Authors:Saeed Ibrahim, Yue Xiao, Dimitrios Tyrovolas, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, Zheng Ma, George K. Karagiannidis, Pinghzi Fan
Title: Cognitive-Radio Functionality: A Novel Configuration for STAR-RIS assisted RSMA Networks
Abstract:
Cognitive radio rate-splitting multiple access (CR-RSMA) has emerged as a promising multiple access framework that can efficiently manage interference and adapt dynamically to heterogeneous quality-of-service (QoS) requirements. To effectively support such demanding access schemes, programmable wireless environments have attracted considerable attention, especially through simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs), which can enable full-space control of signal propagation in asymmetric user deployments. In this paper, we propose the cognitive radio (CR) functionality for STAR-RIS-assisted CR-RSMA systems, leveraging the unique capability of the STAR-RIS to combine element and power splitting for adaptive control of transmission and reflection in CR scenarios. Specifically, the proposed CR functionality partitions the STAR-RIS into two regions independently controlling the transmission and reflection of signals, simultaneously ensuring the required QoS for the primary user and enhancing the performance of the secondary user. To accurately characterize the system performance, we derive analytical expressions for the ergodic rate of the secondary user and the outage rate of the primary user under Nakagami-m fading. Finally, simulation results show that the proposed approach effectively manages interference, guarantees the QoS of the primary user, and significantly improves the throughput of the secondary user, highlighting STAR-RIS as an efficient solution for CR-RSMA-based services.
中文: 该研究提出的STAR-RIS辅助认知无线电速率分割多址系统的认知无线电功能,通过分区独立控制信号传输与反射,在保障主用户服务质量的同时有效提升次用户性能,实现了干扰管理和吞吐量的显著优化。
English: The proposed cognitive radio functionality in STAR-RIS-assisted CR-RSMA systems partitions the surface to independently control signal transmission and reflection, ensuring primary user QoS while enhancing secondary user performance through effective interference management and throughput improvement.

Authors:Fangyikang Wang, Hubery Yin, Lei Qian, Yinan Li, Shaobin Zhuang, Huminhao Zhu, Yilin Zhang, Yanlong Tang, Chao Zhang, Hanbin Zhao, Hui Qian, Chen Li
Title: Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin
Abstract:
The diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution. Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies. While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable. In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm. Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian. This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality. We further conduct a theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of the damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.
中文: 本文提出了Levenberg-Marquardt-Langevin (LML)方法,通过低秩近似和阻尼机制高效逼近扩散模型的Hessian几何结构,在几乎不增加计算成本的情况下显著提升了图像生成质量。
English: This paper introduces the Levenberg-Marquardt-Langevin (LML) method, which efficiently approximates the Hessian geometry in diffusion models using a low-rank approximation and damping mechanism to enhance image generation quality with minimal computational cost.

Authors:Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang
Title: CLaSp: In-Context Layer Skip for Self-Speculative Decoding
Abstract:
Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.
Chinese: CLaSp是一种新颖的自推测解码上下文层跳跃策略,通过动态跳过中间层无需额外训练即可加速大语言模型,在LLaMA3系列上实现1.3~1.7倍加速同时保持生成文本的原始分布。
English: CLaSp is a novel in-context layer-skipping strategy for self-speculative decoding that accelerates LLMs by dynamically skipping intermediate layers without requiring additional training, achieving 1.3x~1.7x speedup on LLaMA3 models while maintaining output quality.

Authors:Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
Title: Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/
Chinese: 本文提出Argus模型,通过以物体为中心的视觉定位机制增强多模态推理能力,在需要精确视觉关注的视觉语言任务中表现卓越。
English: The paper introduces Argus, a multimodal model that enhances visual reasoning through object-centric grounding, effectively improving performance in vision-language tasks requiring precise visual focus.

Authors:Fangyikang Wang, Hubery Yin, Shaobin Zhuang, Huminhao Zhu, Yinan Li, Lei Qian, Chao Zhang, Hanbin Zhao, Hui Qian, Chen Li
Title: Efficiently Access Diffusion Fisher: Within the Outer Product Span Space
Abstract:
Recent Diffusion models (DMs) advancements have explored incorporating the second-order diffusion Fisher information (DF), defined as the negative Hessian of log density, into various downstream tasks and theoretical analysis. However, current practices typically approximate the diffusion Fisher by applying auto-differentiation to the learned score network. This black-box method, though straightforward, lacks any accuracy guarantee and is time-consuming. In this paper, we show that the diffusion Fisher actually resides within a space spanned by the outer products of score and initial data. Based on the outer-product structure, we develop two efficient approximation algorithms to access the trace and matrix-vector multiplication of DF, respectively. These algorithms bypass the auto-differentiation operations with time-efficient vector-product calculations. Furthermore, we establish the approximation error bounds for the proposed algorithms. Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced computational cost of our proposed algorithms. Additionally, based on the novel outer-product formulation of DF, we design the first numerical verification experiment for the optimal transport property of the general PF-ODE deduced map.
Chinese: 本文基于扩散Fisher信息的外积结构,提出了两种高效近似算法,在保持理论误差界的同时,显著提升了似然评估和伴随优化任务的精度与计算效率。
English: This paper introduces efficient algorithms for approximating the diffusion Fisher information in diffusion models by leveraging its outer-product structure, achieving higher accuracy and lower computational cost than existing methods while providing theoretical error bounds.

Authors:Juwei Yue, Haikuo Li, Jiawei Sheng, Xiaodong Li, Taoyu Su, Tingwen Liu, Li Guo
Title: Hyperbolic-PDE GNN: Spectral Graph Neural Networks in the Perspective of A System of Hyperbolic Partial Differential Equations
Abstract:
Graph neural networks (GNNs) leverage message passing mechanisms to learn the topological features of graph data. Traditional GNNs learns node features in a spatial domain unrelated to the topology, which can hardly ensure topological features. In this paper, we formulates message passing as a system of hyperbolic partial differential equations (hyperbolic PDEs), constituting a dynamical system that explicitly maps node representations into a particular solution space. This solution space is spanned by a set of eigenvectors describing the topological structure of graphs. Within this system, for any moment in time, a node features can be decomposed into a superposition of the basis of eigenvectors. This not only enhances the interpretability of message passing but also enables the explicit extraction of fundamental characteristics about the topological structure. Furthermore, by solving this system of hyperbolic partial differential equations, we establish a connection with spectral graph neural networks (spectral GNNs), serving as a message passing enhancement paradigm for spectral GNNs.We further introduce polynomials to approximate arbitrary filter functions. Extensive experiments demonstrate that the paradigm of hyperbolic PDEs not only exhibits strong flexibility but also significantly enhances the performance of various spectral GNNs across diverse graph tasks.
中文摘要:本文提出将图神经网络消息传递构建为双曲偏微分方程系统的新方法,能够在显式提取拓扑特征的同时增强模型可解释性,并显著提升各类谱图神经网络在不同图任务中的性能表现。
English Summary: This paper proposes a novel approach that formulates graph neural network message passing as hyperbolic partial differential equations, enabling explicit topological feature extraction and enhanced interpretability while significantly boosting spectral GNN performance across various graph tasks.

Authors:Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna Vorontsova, Anton Konushin, Vladislav Kurenkov, Danila Rukhovich
Title: cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
Abstract:
Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.
中文: 本文提出了一种多模态CAD重建模型,能同时处理点云、图像和文本三种输入,通过监督微调和强化学习两阶段训练流程,在多个数据集上实现了最先进的性能。
English: This paper introduces a multi-modal CAD reconstruction model that processes point clouds, images, and text simultaneously, achieving state-of-the-art performance through a two-stage training pipeline combining supervised fine-tuning and reinforcement learning.

Authors:Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Abstract:
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
Chinese: 本文提出了一种块级KV缓存机制和置信度感知并行解码策略,在保持生成质量的同时显著提升了扩散大语言模型的推理速度,实现了高达27.6倍的吞吐量提升且精度损失极小。
English: This paper introduces a block-wise KV Cache and a confidence-aware parallel decoding strategy to significantly enhance the inference speed of Diffusion LLMs while maintaining generation quality, achieving up to 27.6× throughput improvement with minimal accuracy loss.

Authors:Feibo Jiang, Cunhua Pan, Li Dong, Kezhi Wang, Octavia A. Dobre, Merouane Debbah
Title: From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications
Abstract:
With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting-edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial's motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM-centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM-based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi-agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next-generation intelligent communication systems.
中文摘要:本教程系统阐述大型人工智能模型与智能体AI如何应对6G通信的感知局限与适应性不足等挑战,通过解析技术架构、设计范式与应用场景,为构建高效安全的下一代通信系统提供全面指导。
English Summary: This tutorial systematically explores how Large AI Models and Agentic AI can address 6G communication challenges by detailing their components, design paradigms, and applications while outlining future research directions.

Authors:Xiangxiang Dai, Xiaowei Sun, Jinhang Zuo, Xutong Liu, John C. S. Lui
Title: A Unified Online-Offline Framework for Co-Branding Campaign Recommendations
Abstract:
Co-branding has become a vital strategy for businesses aiming to expand market reach within recommendation systems. However, identifying effective cross-industry partnerships remains challenging due to resource imbalances, uncertain brand willingness, and ever-changing market conditions. In this paper, we provide the first systematic study of this problem and propose a unified online-offline framework to enable co-branding recommendations. Our approach begins by constructing a bipartite graph linking ``initiating'' and ``target'' brands to quantify co-branding probabilities and assess market benefits. During the online learning phase, we dynamically update the graph in response to market feedback, while striking a balance between exploring new collaborations for long-term gains and exploiting established partnerships for immediate benefits. To address the high initial co-branding costs, our framework mitigates redundant exploration, thereby enhancing short-term performance while ensuring sustainable strategic growth. In the offline optimization phase, our framework consolidates the interests of multiple sub-brands under the same parent brand to maximize overall returns, avoid excessive investment in single sub-brands, and reduce unnecessary costs associated with over-prioritizing a single sub-brand. We present a theoretical analysis of our approach, establishing a highly nontrivial sublinear regret bound for online learning in the complex co-branding problem, and enhancing the approximation guarantee for the NP-hard offline budget allocation optimization. Experiments on both synthetic and real-world co-branding datasets demonstrate the practical effectiveness of our framework, with at least 12\% improvement.
Chinese Summary: 本文提出一个统一的线上线下框架,通过平衡探索与利用来动态推荐跨行业品牌合作,在理论和实践评估中均实现了显著性能提升。
English Summary: This paper introduces a unified online-offline framework that dynamically recommends cross-industry co-branding partnerships by balancing exploration and exploitation, achieving significant performance improvements in both theoretical and practical evaluations.

Authors:Di Wu, Jiaxin Fan, Junzhe Zang, Guanbo Wang, Wei Yin, Wenhao Li, Bo Jin
Title: Reinforced Reasoning for Embodied Planning
Abstract:
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.
中文: 本文提出一种强化微调框架,通过结合监督学习和规则化奖励来增强具身规划能力,在已知和未知环境中均显著超越现有模型。
English: This paper introduces a reinforcement fine-tuning framework that enhances embodied planning by combining supervised learning with rule-based rewards, significantly outperforming existing models in both familiar and new environments.

Authors:Ziyang Zheng, Kezhi Li, Zhengyuan Shi, Qiang Xu
Title: Functional Matching of Logic Subgraphs: Beyond Structural Isomorphism
Abstract:
Subgraph matching in logic circuits is foundational for numerous Electronic Design Automation (EDA) applications, including datapath optimization, arithmetic verification, and hardware trojan detection. However, existing techniques rely primarily on structural graph isomorphism and thus fail to identify function-related subgraphs when synthesis transformations substantially alter circuit topology. To overcome this critical limitation, we introduce the concept of functional subgraph matching, a novel approach that identifies whether a given logic function is implicitly present within a larger circuit, irrespective of structural variations induced by synthesis or technology mapping. Specifically, we propose a two-stage multi-modal framework: (1) learning robust functional embeddings across AIG and post-mapping netlists for functional subgraph detection, and (2) identifying fuzzy boundaries using a graph segmentation approach. Evaluations on standard benchmarks (ITC99, OpenABCD, ForgeEDA) demonstrate significant performance improvements over existing structural methods, with average $93.8\%$ accuracy in functional subgraph detection and a dice score of $91.3\%$ in fuzzy boundary identification.
中文: 本文提出功能子图匹配的新方法,通过两阶段多模态框架识别电路中的逻辑功能而不受结构变化影响,在功能检测和模糊边界识别中分别达到93.8%和91.3%的准确率。
English: This paper introduces functional subgraph matching, a novel approach that identifies logic functions in circuits regardless of structural changes, achieving 93.8% accuracy in detection and 91.3% in boundary identification through a two-stage multimodal framework.

Authors:Weiyu Liu, Neil Nie, Ruohan Zhang, Jiayuan Mao, Jiajun Wu
Title: Learning Compositional Behaviors from Demonstration and Language
Abstract:
We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.
Chinese: BLADE框架通过结合模仿学习和基于模型的规划,利用语言标注演示和大语言模型自动提取结构化动作表示,在仿真和真实机器人实验中均展现出对新颖情境的强大泛化能力。
English: BLADE is a framework that combines imitation learning and model-based planning to enable long-horizon robotic manipulation by automatically extracting structured action representations from language-annotated demonstrations and large language models, demonstrating strong generalization in both simulation and real-world scenarios.

Authors:Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, Bang An, Bayan Bruss, John Langford, Furong Huang
Title: EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles
Abstract:
With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called \textbf{EnsemW2S}, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4\%, and 3.2\% improvements on ID datasets and, upto 6\% and 2.28\% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.
中文摘要:提出的EnsemW2S方法通过令牌级集成策略迭代结合多个弱专家模型,有效提升了弱监督至强泛化能力,在分布内和分布外任务上均取得了显著性能提升。
English Summary: The proposed EnsemW2S method enhances weak-to-strong generalization by iteratively combining multiple weak experts through token-level ensemble training, achieving significant performance improvements on both in-distribution and out-of-distribution tasks.

Authors:Lianghui Zhu, Xitong Ling, Minxi Ouyang, Xiaoping Liu, Tian Guan, Mingxi Fu, Zhiqiang Cheng, Fanglei Fu, Maomao Zeng, Liming Liu, Song Duan, Qiang Huang, Ying Xiao, Jianming Li, Shanming Lu, Zhenghua Piao, Mingxi Zhu, Yibo Jin, Shan Xu, Qiming He, Yizhi Wang, Junru Cheng, Xuanyu Wang, Luxi Xie, Houqiang Li, Sufang Tian, Yonghong He
Title: Subspecialty-Specific Foundation Model for Intelligent Gastrointestinal Pathology
Abstract:
Gastrointestinal (GI) diseases represent a clinically significant burden, necessitating precise diagnostic approaches to optimize patient outcomes. Conventional histopathological diagnosis suffers from limited reproducibility and diagnostic variability. To overcome these limitations, we develop Digepath, a specialized foundation model for GI pathology. Our framework introduces a dual-phase iterative optimization strategy combining pretraining with fine-screening, specifically designed to address the detection of sparsely distributed lesion areas in whole-slide images. Digepath is pretrained on over 353 million multi-scale images from 210,043 H&E-stained slides of GI diseases. It attains state-of-the-art performance on 33 out of 34 tasks related to GI pathology, including pathological diagnosis, protein expression status prediction, gene mutation prediction, and prognosis evaluation. We further translate the intelligent screening module for early GI cancer and achieve near-perfect 99.70% sensitivity across nine independent medical institutions. This work not only advances AI-driven precision pathology for GI diseases but also bridge critical gaps in histopathological practice.
中文: Digepath作为胃肠病理学的基础模型,在多数诊断任务中达到领先水平,并在多家医疗机构实现了近乎完美的早期癌症筛查灵敏度。
English: Digepath, a foundation model for gastrointestinal pathology, achieves state-of-the-art performance on most diagnostic tasks and demonstrates near-perfect sensitivity in early cancer screening across multiple institutions.

Authors:Xupeng Zhu, Yu Qi, Yizhe Zhu, Robin Walters, Robert Platt
Title: EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation
Abstract:
Transformer architectures can effectively learn language-conditioned, multi-task 3D open-loop manipulation policies from demonstrations by jointly processing natural language instructions and 3D observations. However, although both the robot policy and language instructions inherently encode rich 3D geometric structures, standard transformers lack built-in guarantees of geometric consistency, often resulting in unpredictable behavior under SE(3) transformations of the scene. In this paper, we leverage SE(3) equivariance as a key structural property shared by both policy and language, and propose EquAct-a novel SE(3)-equivariant multi-task transformer. EquAct is theoretically guaranteed to be SE(3) equivariant and consists of two key components: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. To evaluate its spatial generalization ability, we benchmark EquAct on 18 RLBench simulation tasks with both SE(3) and SE(2) scene perturbations, and on 4 physical tasks. EquAct performs state-of-the-art across these simulation and physical tasks.
Chinese: 尽管Transformer架构能从语言和演示中有效学习多任务3D操作策略,但缺乏几何一致性保证,因此提出EquAct这一SE(3)等变Transformer,通过等变点云网络和不变语言调制层实现空间泛化,在仿真与实体任务中达到最优性能。
English: Transformer architectures effectively learn multi-task 3D manipulation policies from language and demonstrations but lack geometric consistency, leading to the development of EquAct, an SE(3)-equivariant transformer that ensures spatial generalization and achieves state-of-the-art performance in simulation and physical tasks.

Authors:Ekaterina Fadeeva, Aleksandr Rubashevskii, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
Title: Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Abstract:
Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model's internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.
中文: 检索增强生成(RAG)系统常产生事实性幻觉,但现有方法混淆了事实准确性与对检索内容的忠实度;提出的FRANQ方法通过不确定性量化区分忠实度和整体事实性,在多个数据集和大语言模型上实现了更精确的幻觉检测。
English: Retrieval-Augmented Generation (RAG) systems often produce hallucinations, but existing methods confuse factuality with faithfulness, leading to misclassification; the proposed FRANQ method applies uncertainty quantification to distinguish between faithfulness to retrieved context and overall factuality, achieving superior hallucination detection accuracy across multiple datasets and LLMs.

Authors:Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou
Title: Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
Abstract:
Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.
中文: 仅用920个示例的简单蒸馏方法明显优于零强化学习,它通过更频繁地使用拟人化标记和逻辑连接词,增强了高级认知行为,从而实现了更灵活的推理。
English: A simple distillation method using only 920 examples outperforms zero-RL by fostering more flexible reasoning through increased use of anthropomorphic tokens and logical connectors, enhancing advanced cognitive behaviors.

Authors:Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, Jing Li
Title: Multi-objective Large Language Model Alignment with Hierarchical Experts
Abstract:
Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textit{HoE} consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textit{HoE} across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.
中文: 本文提出HoE方法,这是一种轻量级、参数高效的即插即用方案,无需训练即可让大语言模型适应帕累托前沿,在多个基准测试中优于现有方法。
English: The paper introduces HoE, a lightweight and parameter-efficient approach that enables large language models to adapt across the Pareto frontier without training, outperforming existing methods on multiple benchmarks.

Authors:Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng
Title: PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts
Abstract:
Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC outperforms state-of-the-art controllable EVC methods in emotion conversion, intensity control, mixed emotion synthesis, and prosody manipulation. Speech samples are available at https://jeremychee4.github.io/PromptEVC/.
Chinese: PromptEVC通过自然语言提示实现精确的情感语音转换,在情感表达和韵律控制方面优于现有方法,提升了语音合成的自然度和多样性。
English: PromptEVC introduces a novel approach to emotional voice conversion using natural language prompts for precise emotion control, outperforming existing methods in flexibility and performance.

Authors:Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma
Title: Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction
Abstract:
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.
Chinese: 本研究提出了一种即插即用的说话人交互注意力模块,通过利用多人场景中同时出现的面部信息来增强音视频说话人提取效果,在多个数据集上均表现出优于基准方法的性能,并通过跨数据集验证证实了其鲁棒性和泛化能力。
English: This research introduces a plug-and-play inter-speaker attention module that leverages co-occurring faces in multi-person environments to enhance audio-visual speaker extraction, demonstrating consistent performance improvements across diverse datasets and confirming robustness through cross-dataset evaluations.

Authors:Jiabao Ji, Yongchao Chen, Yang Zhang, Ramana Rao Kompella, Chuchu Fan, Gaowen Liu, Shiyu Chang
Title: Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners
Abstract:
Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. To address this issue, we propose a novel framework that integrates reinforcement learning with verifiable rewards (RLVR) to incentivize knowledge of physical constraints into LLMs to induce constraints-aware reasoning during plan generation. In this approach, only valid action plans that successfully complete a control task receive positive rewards. We applied our method to two small-scale LLMs: a non-reasoning Qwen2.5-3B-Instruct and a reasoning Qwen3-4B. The experiment results demonstrate that constraint-aware small LLMs largely outperform large-scale models without constraints, grounded on both the BoxNet task and a newly developed BoxNet3D environment built using MuJoCo. This work highlights the effectiveness of grounding even small LLMs with physical constraints to enable scalable and efficient multi-robot control in complex, physically constrained environments.
中文: 本研究提出了一种带可验证奖励的强化学习框架,通过将物理约束知识融入小型语言模型,使其在机器人控制任务中生成避免越界或碰撞的有效行动计划,从而超越无约束的大型模型性能。
English: This study introduces a reinforcement learning framework with verifiable rewards to enhance physical constraint awareness in small language models, enabling them to outperform larger models in robot control tasks by generating valid action plans that avoid violations like unreachable targets or collisions.

Authors:Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, Xun Huang
Title: Long-Context State-Space Video World Models
Abstract:
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
中文: 本文提出了一种新颖的视频扩散架构,通过结合状态空间模型与分块扫描机制及局部注意力,有效增强了自回归帧预测中的长期记忆能力,在空间检索任务中表现优异的同时保持了计算效率。
English: This paper introduces a novel video diffusion architecture that integrates state-space models with a block-wise scanning scheme and local attention to enhance long-term memory in autoregressive frame prediction, achieving superior performance in spatial retrieval tasks while maintaining computational efficiency.

Authors:Qi Li, Kun Li, Haozhi Han, Honghui Shang, Xinfu He, Yunquan Zhang, Hong An, Ting Cao, Mao Yang
Title: SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale
Abstract:
Can a scientific simulation system be physically consistent, interpretable by design, and scalable across regimes--all at once? Despite decades of progress, this trifecta remains elusive. Classical methods like Kinetic Monte Carlo ensure thermodynamic accuracy but scale poorly; learning-based methods offer efficiency but often sacrifice physical consistency and interpretability. We present SwarmThinkers, a reinforcement learning framework that recasts atomic-scale simulation as a physically grounded swarm intelligence system. Each diffusing particle is modeled as a local decision-making agent that selects transitions via a shared policy network trained under thermodynamic constraints. A reweighting mechanism fuses learned preferences with transition rates, preserving statistical fidelity while enabling interpretable, step-wise decision making. Training follows a centralized-training, decentralized-execution paradigm, allowing the policy to generalize across system sizes, concentrations, and temperatures without retraining. On a benchmark simulating radiation-induced Fe-Cu alloy precipitation, SwarmThinkers is the first system to achieve full-scale, physically consistent simulation on a single A100 GPU, previously attainable only via OpenKMC on a supercomputer. It delivers up to 4963x (3185x on average) faster computation with 485x lower memory usage. By treating particles as decision-makers, not passive samplers, SwarmThinkers marks a paradigm shift in scientific simulation--one that unifies physical consistency, interpretability, and scalability through agent-driven intelligence.
中文: SwarmThinkers 通过将原子粒子建模为决策代理的强化学习框架,在单个GPU上实现了物理一致、可解释且可扩展的仿真,计算速度提升高达4963倍,内存使用降低485倍。
English: SwarmThinkers is a reinforcement learning framework that models atomic particles as decision-making agents, achieving physically consistent, interpretable, and scalable simulations with up to 4963x speedup and 485x lower memory usage on a single GPU.

Authors:Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, Artem Shelmanov
Title: Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs
Abstract:
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations". Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain "uncertainty-aware" heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (<1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.
中文摘要:提出的RAUQ方法通过分析注意力模式有效检测大语言模型中的幻觉,无需监督且计算成本极低,性能优于现有方法。
English Summary: The proposed RAUQ method efficiently detects hallucinations in large language models by analyzing attention patterns, outperforming existing methods with minimal computational cost and no supervision.

Authors:Juwei Yue, Haikuo Li, Jiawei Sheng, Yihan Guo, Xinghua Zhang, Chuan Zhou, Tingwen Liu, Li Guo
Title: Graph Wave Networks
Abstract:
Dynamics modeling has been introduced as a novel paradigm in message passing (MP) of graph neural networks (GNNs). Existing methods consider MP between nodes as a heat diffusion process, and leverage heat equation to model the temporal evolution of nodes in the embedding space. However, heat equation can hardly depict the wave nature of graph signals in graph signal processing. Besides, heat equation is essentially a partial differential equation (PDE) involving a first partial derivative of time, whose numerical solution usually has low stability, and leads to inefficient model training. In this paper, we would like to depict more wave details in MP, since graph signals are essentially wave signals that can be seen as a superposition of a series of waves in the form of eigenvector. This motivates us to consider MP as a wave propagation process to capture the temporal evolution of wave signals in the space. Based on wave equation in physics, we innovatively develop a graph wave equation to leverage the wave propagation on graphs. In details, we demonstrate that the graph wave equation can be connected to traditional spectral GNNs, facilitating the design of graph wave networks based on various Laplacians and enhancing the performance of the spectral GNNs. Besides, the graph wave equation is particularly a PDE involving a second partial derivative of time, which has stronger stability on graphs than the heat equation that involves a first partial derivative of time. Additionally, we theoretically prove that the numerical solution derived from the graph wave equation are constantly stable, enabling to significantly enhance model efficiency while ensuring its performance. Extensive experiments show that GWNs achieve SOTA and efficient performance on benchmark datasets, and exhibit outstanding performance in addressing challenging graph problems, such as over-smoothing and heterophily.
中文: 本文提出图波动方程作为图神经网络中消息传递的新范式,替代传统热扩散方法以更好地捕捉波动式信号传播,增强数值稳定性,并在基准数据集上实现最优性能。
English: This paper introduces a graph wave equation as a novel message passing paradigm in graph neural networks, replacing the traditional heat diffusion approach to better capture wave-like signal propagation, enhance numerical stability, and achieve state-of-the-art performance on benchmark datasets.

Authors:Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao
Title: Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking
Abstract:
Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.
中文: STeP方法通过使用自我反思轨迹和部分遮蔽策略,改进了基于大语言模型的智能体训练,使其在多个任务中以更少数据实现更优性能。
English: The STeP method enhances LLM-based agent training by using self-reflected trajectories and partial masking to improve learning from teacher models, achieving better performance with less data across multiple tasks.

Authors:Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao
Title: Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Abstract:
The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.
中文: 提出的残差对齐模型(RAM)将对齐过程形式化为重要性采样,实现了与上游模型的解耦,通过高效的序列级训练和迭代解码策略,在多项任务中优于基线方法且提升了灵活性。
English: The proposed Residual Alignment Model (RAM) introduces a flexible alignment framework that treats alignment as importance sampling, enabling efficient training and improved performance across diverse tasks without retraining base models.

Authors:Pusheng Xu, Xia Gong, Xiaolan Chen, Weiyi Zhang, Jiancheng Yang, Bingjie Yan, Meng Yuan, Yalin Zheng, Mingguang He, Danli Shi
Title: Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat
Abstract:
Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P < 0.001) and Qwen2.5-VL-72B-Instruct (0.514, P < 0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.
中文: 本研究推出了首个眼科双语视觉问答基准,利用微信真实数据评估视觉语言模型,其中Gemini 2.0 Flash在性能比较中取得了最高综合准确率。
English: This study introduces the first bilingual VQA benchmark for ophthalmology, leveraging real-world WeChat data to evaluate VLMs, with Gemini 2.0 Flash achieving the highest overall accuracy in performance comparisons.

Authors:Jingyuan Liu, Zeyu Zhang, Xuchuang Wang, Xutong Liu, John C. S. Lui, Mohammad Hajiesmaili, Carlee Joe-Wong
Title: Offline Clustering of Linear Bandits: Unlocking the Power of Clusters in Data-Limited Environments
Abstract:
Contextual linear multi-armed bandits are a learning framework for making a sequence of decisions, e.g., advertising recommendations for a sequence of arriving users. Recent works have shown that clustering these users based on the similarity of their learned preferences can significantly accelerate the learning. However, prior work has primarily focused on the online setting, which requires continually collecting user data, ignoring the offline data widely available in many applications. To tackle these limitations, we study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making across multiple users. The key challenge in Off-ClusBand arises from data insufficiency for users: unlike the online case, in the offline case, we have a fixed, limited dataset to work from and thus must determine whether we have enough data to confidently cluster users together. To address this challenge, we propose two algorithms: Off-C$^2$LUB, which we analytically show performs well for arbitrary amounts of user data, and Off-CLUB, which is prone to bias when data is limited but, given sufficient data, matches a theoretical lower bound that we derive for the offline clustered MAB problem. We experimentally validate these results on both real and synthetic datasets.
中文摘要:本文提出Off-ClusBand框架解决上下文线性多臂老虎机在离线聚类中的数据不足问题,通过设计Off-C²LUB和Off-CLUB两种算法,在理论证明与实验验证中展示了如何利用离线数据集优化跨用户决策。
English Summary: This paper introduces the Off-ClusBand framework to address data limitations in offline clustering of contextual linear multi-armed bandits, proposing two algorithms—Off-C²LUB and Off-CLUB—that effectively leverage offline datasets to improve decision-making while handling varying data sufficiency levels.

Authors:Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, Jingbo Shang
Title: The Price of Format: Diversity Collapse in LLMs
Abstract:
Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
中文摘要:指令调优大语言模型的结构化模板会导致多样性崩溃,限制创意输出的可变性;尽管格式一致性有利于结构化任务,但最小化格式化最能保持输出多样性。
English Summary: Instruction-tuned LLMs' structured templates cause diversity collapse, limiting creative output variability, and while format consistency benefits structured tasks, minimal formatting best preserves diversity.

Authors:Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg
Title: From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
Abstract:
Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.
中文:后处理对于评估大语言模型在代码填充生成中的表现至关重要,用以去除多余代码,而监督微调则能通过实现代码无缝集成来提升性能,尤其在处理完整代码行时效果显著。
English: Post-processing is essential for evaluating LLMs in fill-in-the-middle code generation to remove extraneous code, though supervised fine-tuning improves performance by enabling seamless code integration, particularly with complete lines.

Authors:Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang
Title: Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Abstract:
Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
中文: 本文提出ReasonMap基准,通过交通地图评估多模态大语言模型的细粒度视觉理解与空间推理能力,发现开源基础模型优于推理变体,而闭源模型呈现相反趋势。
English: This paper introduces ReasonMap, a benchmark for evaluating multimodal large language models' fine-grained visual understanding and spatial reasoning using transit maps, revealing that open-source base models outperform reasoning variants while closed-source models show the opposite trend.

Authors:Guoheng Sun, Ziyao Wang, Xuandong Zhao, Bowei Tian, Zheyu Shen, Yexiao He, Jinming Xing, Ang Li
Title: Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services
Abstract:
Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textit{quantity inflation}, where token and call counts may be artificially inflated, and \textit{quality downgrade}, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.
中文摘要:本文揭示了商业不透明大语言服务中用户为不可见操作付费的问责风险,提出在不泄露商业机密的前提下通过审计框架和验证机制确保服务透明度的解决方案。
English Summary: This position paper identifies accountability risks in Commercial Opaque LLM Services where users are billed for unobservable operations, proposing auditing frameworks and verification mechanisms to ensure transparency without compromising proprietary information.

Authors:Junyan Zhang, Yiming Huang, Shuliang Liu, Yubo Gao, Xuming Hu
Title: Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?
Abstract:
The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.
中文: 本研究表明,在文本分类中,BERT类模型常优于大语言模型,尤其在模式驱动任务上,据此提出任务特定策略TaMAS,倡导根据任务特性选择模型而非一概依赖大语言模型。
English: This study demonstrates that BERT-like models frequently surpass LLMs in text classification, particularly for pattern-driven tasks, leading to the development of TaMAS, a task-specific strategy that promotes a tailored approach over universal LLM dependency.

Authors:Feifan Wang, Tengfei Song, Minggui He, Chang Su, Zhanglin Wu, Hao Yang, Wenming Zheng, Osamu Yoshie
Title: Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation
Abstract:
Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.
中文: SEKE方法通过自验证和情感知识增强,为视觉大语言模型高效生成高质量多粒度指令数据,显著提升了面部情感感知性能,在情感分析任务中优于现有先进方法。
English: The SEKE method enhances facial emotion perception in vision large language models by generating high-quality, multi-grained instruction data cost-effectively through self-verification and emotion knowledge integration, significantly outperforming existing approaches on emotion analysis tasks.

Authors:Zherui Zhang, Jiaxin Wu, Changwei Wang, Rongtao Xu, Longzhao Huang, Wenhao Xu, Wenbo Xu, Li Guo, Shibiao Xu
Title: FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation
Abstract:
Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose {\large {\textbf{F}}}aster {\large {\textbf{D}}}istillation-{\large {\textbf{B}}}ased {\large {\textbf{P}}}rompt {\large {\textbf{L}}}earning (\textbf{FDBPL}), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving $2.2\times$ faster training speed.
FDBPL introduces a faster distillation-based prompt learning method that enhances training efficiency and generalization by sharing supervision contexts, accelerating I/O, and employing dual positive-negative prompt spaces with mutual learning.
English Summary:

Authors:Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyun Chang, Hamid Alinejad-Rokny, Bo Zheng, Min Yang
Title: EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications
Abstract:
E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.
中文: EVADE是首个专家构建的中文多模态基准,专门用于评估基础模型在电商领域识别规避性内容的能力,结果显示即使最先进的模型也频繁误判,为内容审核系统的改进奠定了基础。
English: The EVADE benchmark is introduced as the first expert-curated, multimodal dataset in Chinese to evaluate foundation models' ability to detect evasive content in e-commerce, revealing significant performance gaps in current LLMs and VLMs despite their advanced capabilities.

Authors:Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma
Title: What Needs Attention? Prioritizing Drivers of Developers' Trust and Adoption of Generative AI
Abstract:
Generative AI (genAI) tools are advertised as productivity aids. Yet, issues related to miscalibrated trust and usage friction continue to hinder their adoption. Additionally, AI can be exclusionary, failing to support diverse users adequately, further exacerbating these concerns. One such aspect of diversity is cognitive diversity -- variations in users' cognitive styles -- that leads to divergence in interaction styles. When an individual's cognitive styles are unsupported, it creates additional barriers to technology adoption. Thus, to design tools that developers trust, we must first understand what factors affect their trust and intentions to use these tools in practice? We developed a theoretical model of factors influencing trust and adoption intentions towards genAI through a large-scale survey with developers (N=238) at GitHub and Microsoft. Using Partial Least Squares-Structural Equation Modeling (PLS-SEM), we found that genAI's system/output quality, functional value, and goal maintenance significantly influence developers' trust, which along with their cognitive styles, affects their intentions to use these tools in work. An Importance-Performance Matrix Analysis (IPMA) identified factors that, despite their strong influence, underperform, revealing specific genAI aspects that need design prioritization. We bolster these findings by qualitatively analyzing developers' perceived challenges and risks of genAI usage to uncover why these gaps persist in development contexts. For genAI to indeed be a true productivity aid rather than a disguised productivity sink, it must align with developers' goals, maintain contextual transparency, reduce cognitive burden, and provide equitable interaction support. We provide practical suggestions to guide future genAI tool design for effective, trustworthy, and inclusive human-genAI interactions.
中文: 生成式AI工具的采用受制于信任失调和使用摩擦,认知多样性是关键因素,开发者调查表明系统质量、功能价值和目标维护影响信任与使用意愿,需通过设计优化实现公平交互支持。
English: Generative AI tools face adoption barriers due to trust miscalibration and usage friction, with cognitive diversity being a key factor, as shown by a developer survey revealing that system quality, functional value, and goal maintenance influence trust and usage intentions, necessitating design improvements for equitable support.

Authors:Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, Bjoern Menze
Title: CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation
Abstract:
Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
中文摘要:CRG评分是一种新颖的、考虑分布特征的评估指标,旨在通过关注相关异常并平衡惩罚机制,对放射学报告生成的临床准确性进行更公平的评估。
English Summary: The CRG Score is a novel, distribution-aware metric designed to evaluate clinical accuracy in radiology report generation by focusing on relevant abnormalities and balancing penalties for fairer assessment.

Authors:Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang
Title: MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
Abstract:
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
中文摘要:本文提出多轮安全对齐(MTSA)框架,通过思维引导攻击学习和对抗性迭代优化两阶段设计,有效提升大语言模型在多轮对话中的安全防御能力。
English Summary: The paper introduces the Multi-Turn Safety Alignment (MTSA) framework to counter hidden malicious intents in multi-round dialogues, using thought-guided attack learning and adversarial optimization to enhance LLM security.

Authors:Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu
Title: LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
Abstract:
High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
Chinese: LongMagpie是一种自合成框架,通过利用对齐的大语言模型从文档中生成上下文相关的查询,自动创建高质量的长上下文指令数据,在长上下文任务中表现领先,同时在短上下文任务中保持竞争力。
English: LongMagpie is a self-synthesis framework that automatically generates high-quality long-context instruction data by leveraging aligned LLMs to produce contextually relevant queries from documents, achieving top performance in long-context tasks while remaining competitive in short-context ones.

Authors:Junlin Li, Guodong DU, Jing Li, Sim Kuan Goh, Wenya Wang, Yequan Wang, Fangming Liu, Ho-Kin Tang, Saleh Alharbi, Daojing He, Min Zhang
Title: Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Abstract:
Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.
中文: 本文提出MMER,一种无需训练的方法,通过复用现有多模态大语言模型(MLLM)的编码器并融合参数,在保持99%原始性能的同时扩展多模态能力,显著减轻灾难性遗忘问题。
English: The paper introduces MMER, a training-free method that integrates existing Multimodal Large Language Models (MLLMs) by reusing their encoders and merging parameters to expand multimodal capabilities while preserving 99% of original performance and reducing catastrophic forgetting.

Authors:Boce Hu, Dian Wang, David Klee, Heng Tian, Xupeng Zhu, Haojie Huang, Robert Platt, Robin Walters
Title: 3D Equivariant Visuomotor Policy Learning via Spherical Projection
Abstract:
Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in SO(3) without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.
中文: 本文提出了一种SO(3)等变扩散策略框架,通过将二维RGB相机特征投影到球面上,无需点云重建即可实现高效机器人操作,并在性能和样本效率上均优于现有基线方法。
English: This paper introduces an SO(3)-equivariant diffusion policy framework that projects 2D RGB camera features onto a sphere, enabling effective robotic manipulation without point cloud reconstruction and demonstrating superior performance and sample efficiency.

Authors:Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, Peng Di
Title: Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
Abstract:
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.
中文摘要:本文通过将代码图结构集成到开源大语言模型的注意力机制中,证明了无需基于代理的方法即可有效处理仓库级软件任务,并在基准测试中取得了顶尖性能。
English Summary: This paper demonstrates that open-source large language models can effectively handle repository-level software tasks by integrating code graph structures into their attention mechanisms, achieving top-tier performance on benchmarks without relying on agent-based approaches.

Authors:Wei Xiao, Jiacheng Liu, Zifeng Zhuang, Runze Suo, Shangke Lyu, Donglin Wang
Title: Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only
Abstract:
Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
Chinese Summary: 本研究提出PORL方法,仅利用预训练策略实现高效的在线强化学习微调,无需依赖预训练的Q函数,从而克服保守性限制并拓宽应用范围。
English Summary: The study introduces PORL, a novel method that enables efficient online reinforcement learning fine-tuning using only pre-trained policies, eliminating the need for pre-trained Q-functions to overcome limitations of conservatism and expand applicability.

Authors:Jingzhe Liu, Zhigang Hua, Yan Xie, Bingheng Li, Harry Shomer, Yu Song, Kaveh Hassani, Jiliang Tang
Title: Higher-order Structure Boosts Link Prediction on Temporal Graphs
Abstract:
Temporal Graph Neural Networks (TGNNs) have gained growing attention for modeling and predicting structures in temporal graphs. However, existing TGNNs primarily focus on pairwise interactions while overlooking higher-order structures that are integral to link formation and evolution in real-world temporal graphs. Meanwhile, these models often suffer from efficiency bottlenecks, further limiting their expressive power. To tackle these challenges, we propose a Higher-order structure Temporal Graph Neural Network, which incorporates hypergraph representations into temporal graph learning. In particular, we develop an algorithm to identify the underlying higher-order structures, enhancing the model's ability to capture the group interactions. Furthermore, by aggregating multiple edge features into hyperedge representations, HTGN effectively reduces memory cost during training. We theoretically demonstrate the enhanced expressiveness of our approach and validate its effectiveness and efficiency through extensive experiments on various real-world temporal graphs. Experimental results show that HTGN achieves superior performance on dynamic link prediction while reducing memory costs by up to 50\% compared to existing methods.
Chinese: 提出的高阶时序图神经网络(HTGN)通过引入超图表示来捕捉时序图中的群体交互,不仅提升了动态链接预测的准确性,还显著优化了效率,相比现有方法内存成本降低高达50%。
English: The proposed Higher-order Temporal Graph Neural Network (HTGN) integrates hypergraph representations to capture group interactions in temporal graphs, improving both predictive accuracy and efficiency by reducing memory costs by up to 50% compared to existing methods.

Authors:Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe
Title: Policy Testing in Markov Decision Processes
Abstract:
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
中文摘要:本文通过将下界优化问题重构为反向MDP中的策略优化任务,提出了一种在折扣马尔可夫决策过程中统计最优且计算高效的策略测试算法。
English Summary: This paper introduces a statistically optimal and computationally efficient algorithm for testing policies in discounted Markov decision processes by reformulating a lower-bound optimization problem as a policy optimization task in a reversed MDP.

Authors:Chaozheng Wang, Zezhou Yang, Shuzheng Gao, Cuiyun Gao, Ting Peng, Hailiang Huang, Yuetang Deng, Michael Lyu
Title: RAG or Fine-tuning? A Comparative Study on LCMs-based Code Completion in Industry
Abstract:
Code completion, a crucial practice in industrial settings, helps developers improve programming efficiency by automatically suggesting code snippets during development. With the emergence of Large Code Models (LCMs), this field has witnessed significant advancements. Due to the natural differences between open-source and industrial codebases, such as coding patterns and unique internal dependencies, it is a common practice for developers to conduct domain adaptation when adopting LCMs in industry. There exist multiple adaptation approaches, among which retrieval-augmented generation (RAG) and fine-tuning are the two most popular paradigms. However, no prior research has explored the trade-off of the two approaches in industrial scenarios. To mitigate the gap, we comprehensively compare the two paradigms including Retrieval-Augmented Generation (RAG) and Fine-tuning (FT), for industrial code completion in this paper. In collaboration with Tencent's WXG department, we collect over 160,000 internal C++ files as our codebase. We then compare the two types of adaptation approaches from three dimensions that are concerned by industrial practitioners, including effectiveness, efficiency, and parameter sensitivity, using six LCMs. Our findings reveal that RAG, when implemented with appropriate embedding models that map code snippets into dense vector representations, can achieve higher accuracy than fine-tuning alone. Specifically, BM25 presents superior retrieval effectiveness and efficiency among studied RAG methods. Moreover, RAG and fine-tuning are orthogonal and their combination leads to further improvement. We also observe that RAG demonstrates better scalability than FT, showing more sustained performance gains with larger scales of codebase.
中文: 本研究比较了检索增强生成和微调在工业代码补全中的适应效果,发现采用合适嵌入模型的RAG方法比单独微调精度更高、扩展性更好,且二者结合能实现进一步性能提升。
English: This study compares retrieval-augmented generation (RAG) and fine-tuning for adapting large code models to industrial code completion, finding RAG with suitable embedding models achieves higher accuracy and better scalability while their combination yields optimal results.

Authors:Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, Liu Ren
Title: ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving
Abstract:
Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between "fast" vision-based autonomous driving systems and "slow" language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.
Chinese: 本文提出ALN-P3协同蒸馏框架,通过感知、预测和规划三个对齐模块,在训练阶段将视觉自动驾驶系统与语言推理相融合,实现了驾驶性能与语言理解能力的同步提升且无需推理成本。
English: This paper introduces ALN-P3, a co-distillation framework that aligns vision-based autonomous driving with language reasoning through perception, prediction, and planning modules to enhance both driving performance and interpretability without inference overhead.

Authors:Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo
Title: Self-Evolving Curriculum for LLM Reasoning
Abstract:
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
中文摘要:本研究提出的自进化课程(SEC)方法在大型语言模型的强化学习微调过程中自动优化训练序列,显著提升了多个领域的推理能力,并增强了对更困难问题的泛化性能。
English Summary: The proposed Self-Evolving Curriculum (SEC) method automatically optimizes training sequences during reinforcement learning fine-tuning of large language models, significantly enhancing reasoning capabilities across multiple domains while improving generalization to harder problems.

Authors:Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai
Title: Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization
Abstract:
Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.
Chinese: 本研究提出一种通过属性分类器优化语义嵌入的方法,无需依赖文本提示或重新训练扩散模型,即可实现精确解耦的图像编辑,并在实验中展现出优异的泛化能力。
English: This study introduces a method that optimizes semantic embeddings using attribute classifiers to enable precise and disentangled image editing in text-to-image diffusion models, eliminating the need for manual text prompts or model retraining.

Authors:Akanksha Agrawal, Fedor V. Fomin, Daniel Lokshtanov, Saket Saurabh, Prafullkumar Tale
Title: Path Contraction Faster than $2^n$
Abstract:
A graph $G$ is contractible to a graph $H$ if there is a set $X \subseteq E(G)$, such that $G/X$ is isomorphic to $H$. Here, $G/X$ is the graph obtained from $G$ by contracting all the edges in $X$. For a family of graphs $\cal F$, the $\mathcal{F}$-\textsc{Contraction} problem takes as input a graph $G$ on $n$ vertices, and the objective is to output the largest integer $t$, such that $G$ is contractible to a graph $H \in {\cal F}$, where $|V(H)|=t$. When $\cal F$ is the family of paths, then the corresponding $\mathcal{F}$-\textsc{Contraction} problem is called \textsc{Path Contraction}. The problem \textsc{Path Contraction} admits a simple algorithm running in time $2^{n}\cdot n^{\mathcal{O}(1)}$. In spite of the deceptive simplicity of the problem, beating the $2^{n}\cdot n^{\mathcal{O}(1)}$ bound for \textsc{Path Contraction} seems quite challenging. In this paper, we design an exact exponential time algorithm for \textsc{Path Contraction} that runs in time $1.99987^n\cdot n^{\mathcal{O}(1)}$. We also define a problem called \textsc{$3$-Disjoint Connected Subgraphs}, and design an algorithm for it that runs in time $1.88^n\cdot n^{\mathcal{O}(1)}$. The above algorithm is used as a sub-routine in our algorithm for {\sc Path Contraction}
Chinese: 本文针对路径收缩问题提出了一种改进的精确指数时间算法,运行时间为1.99987ⁿ·n^O⁽¹⁾,通过利用三维不相交连通子图问题的新算法作为子程序,突破了先前2ⁿ·n^O⁽¹⁾的时间界限。
English: This paper presents an improved exact exponential time algorithm for the Path Contraction problem that runs in time 1.99987ⁿ·n^O⁽¹⁾, breaking the previous 2ⁿ·n^O⁽¹⁾ barrier by utilizing a new algorithm for the 3-Disjoint Connected Subgraphs problem as a subroutine.

Authors:Jinzhou Li, Tianhao Wu, Jiyao Zhang, Zeyuan Chen, Haotian Jin, Mingdong Wu, Yujun Shen, Yaodong Yang, Hao Dong
Title: Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation
Abstract:
Effectively utilizing multi-sensory data is important for robots to generalize across diverse tasks. However, the heterogeneous nature of these modalities makes fusion challenging. Existing methods propose strategies to obtain comprehensively fused features but often ignore the fact that each modality requires different levels of attention at different manipulation stages. To address this, we propose a force-guided attention fusion module that adaptively adjusts the weights of visual and tactile features without human labeling. We also introduce a self-supervised future force prediction auxiliary task to reinforce the tactile modality, improve data imbalance, and encourage proper adjustment. Our method achieves an average success rate of 93% across three fine-grained, contactrich tasks in real-world experiments. Further analysis shows that our policy appropriately adjusts attention to each modality at different manipulation stages. The videos can be viewed at https://adaptac-dex.github.io/.
Chinese: 我们提出的力引导注意力融合模块无需人工标注即可自适应调整视觉与触觉特征权重,通过自监督的力预测在真实世界接触密集型任务中实现93%的平均成功率。
English: Our proposed force-guided attention fusion module adaptively adjusts visual and tactile feature weights without human labeling, achieving a 93% success rate in real-world contact-rich tasks through self-supervised force prediction.

Authors:Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang
Title: InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion
Abstract:
Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
中文: InfiGFusion提出了一种结构感知融合框架,通过图对数值蒸馏建模词汇维度间的语义依赖,在推理、编程和数学等多项基准测试中显著优于现有方法。
English: InfiGFusion introduces a structure-aware fusion framework with Graph-on-Logits Distillation to model semantic dependencies in vocabulary dimensions, significantly outperforming existing methods across reasoning, coding, and mathematics benchmarks.

Authors:Yanggan Gu, Zhaoyi Yan, Yuanyi Wang, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang
Title: InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Abstract:
Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
Chinese: 本文提出的InfiFPO方法通过融合多源概率信息进行隐式模型融合,在11个基准测试中显著优于现有方法,有效提升了模型在数学、编程和推理任务上的性能表现。
English: This paper introduces InfiFPO, an implicit model fusion method that enhances preference alignment by integrating multi-source probability information, significantly outperforming existing approaches across 11 benchmarks.

Authors:Dian Wang, Boce Hu, Shuran Song, Robin Walters, Robert Platt
Title: A Practical Guide for Incorporating Symmetry in Diffusion Policy
Abstract:
Recently, equivariant neural networks for policy learning have shown promising improvements in sample efficiency and generalization, however, their wide adoption faces substantial barriers due to implementation complexity. Equivariant architectures typically require specialized mathematical formulations and custom network design, posing significant challenges when integrating with modern policy frameworks like diffusion-based models. In this paper, we explore a number of straightforward and practical approaches to incorporate symmetry benefits into diffusion policies without the overhead of full equivariant designs. Specifically, we investigate (i) invariant representations via relative trajectory actions and eye-in-hand perception, (ii) integrating equivariant vision encoders, and (iii) symmetric feature extraction with pretrained encoders using Frame Averaging. We first prove that combining eye-in-hand perception with relative or delta action parameterization yields inherent SE(3)-invariance, thus improving policy generalization. We then perform a systematic experimental study on those design choices for integrating symmetry in diffusion policies, and conclude that an invariant representation with equivariant feature extraction significantly improves the policy performance. Our method achieves performance on par with or exceeding fully equivariant architectures while greatly simplifying implementation.
Chinese: 本文提出通过不变表示和等变特征提取将对称性融入扩散策略的实用方法,以简化实现达到与完全等变架构相当甚至更优的性能。
English: This paper introduces practical methods to integrate symmetry into diffusion policies through invariant representations and equivariant feature extraction, achieving performance comparable to fully equivariant architectures with simplified implementation.

Authors:Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, Dmitry Dylov
Title: Zero-Shot Adaptation of Behavioral Foundation Models to Unseen Dynamics
Abstract:
Behavioral Foundation Models (BFMs) proved successful in producing policies for arbitrary tasks in a zero-shot manner, requiring no test-time training or task-specific fine-tuning. Among the most promising BFMs are the ones that estimate the successor measure learned in an unsupervised way from task-agnostic offline data. However, these methods fail to react to changes in the dynamics, making them inefficient under partial observability or when the transition function changes. This hinders the applicability of BFMs in a real-world setting, e.g., in robotics, where the dynamics can unexpectedly change at test time. In this work, we demonstrate that Forward-Backward (FB) representation, one of the methods from the BFM family, cannot distinguish between distinct dynamics, leading to an interference among the latent directions, which parametrize different policies. To address this, we propose a FB model with a transformer-based belief estimator, which greatly facilitates zero-shot adaptation. We also show that partitioning the policy encoding space into dynamics-specific clusters, aligned with the context-embedding directions, yields additional gain in performance. These traits allow our method to respond to the dynamics observed during training and to generalize to unseen ones. Empirically, in the changing dynamics setting, our approach achieves up to a 2x higher zero-shot returns compared to the baselines for both discrete and continuous tasks.
中文摘要:行为基础模型在零样本策略生成方面表现出色,但难以应对动态变化;本研究通过引入基于Transformer的信念估计器和策略编码聚类,显著提升了模型的适应性和性能。
English Summary: Behavioral Foundation Models (BFMs) excel at zero-shot policy generation but struggle with dynamic changes, which this work addresses by introducing a transformer-based belief estimator and policy encoding clustering to enhance adaptability and performance.

Authors:Haochen Yuan, Minting Pan, Yunbo Wang, Siyu Gao, Philip S. Yu, Xiaokang Yang
Title: Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization
Abstract:
Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data. However, traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset. These offline policies are less generalizable as they fail to account for the non-stationary nature of the market. Our approach, MetaTrader, frames portfolio optimization as a new type of partial-offline RL problem and makes two technical contributions. First, MetaTrader employs a bilevel learning framework that explicitly trains the RL agent to improve both in-domain profits on the original dataset and out-of-domain performance across diverse transformations of the raw financial data. Second, our approach incorporates a new temporal difference (TD) method that approximates worst-case TD estimates from a batch of transformed TD targets, addressing the value overestimation issue that is particularly challenging in scenarios with limited offline data. Our empirical results on two public stock datasets show that MetaTrader outperforms existing methods, including both RL-based approaches and traditional stock prediction models.
Chinese: MetaTrader 采用双层学习框架和新型时序差分方法,在投资组合优化中同时提升域内和域外表现,有效克服传统强化学习在非平稳市场中的泛化不足问题,并在股票数据集上优于现有方法。
English: MetaTrader introduces a bilevel learning framework and a novel temporal difference method to enhance both in-domain and out-of-domain performance in portfolio optimization, effectively addressing the limitations of traditional reinforcement learning by improving generalization in non-stationary markets and outperforming existing methods on stock datasets.

Authors:Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Dong-Ki Kim, Kyunghoon Bae, Honglak Lee
Title: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Abstract:
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.
中文:该研究推出了MONDAY数据集,通过教学视频构建大规模移动操作系统导航数据,并开发自动化框架实现数据集扩展,无需人工标注即可显著提升跨平台导航性能。
English: The study introduces MONDAY, a large-scale dataset from instructional videos that enhances cross-platform mobile OS navigation for agents, and an automated framework for dataset expansion, achieving significant performance improvements without manual annotation.

Authors:Vasilis K. Papanikolaou, Gui Zhou, Brikena Kaziu, Ata Khalili, Panagiotis D. Diamantoulakis, George K. Karagiannidis, Robert Schober
Title: Resolving the Double Near-Far Problem via Wireless Powered Pinching-Antenna Networks
Abstract:
This letter introduces a novel wireless powered communication system, referred to as a wireless powered pinching-antenna network (WPPAN), utilizing a single waveguide with pinching antennas to address the double near-far problem inherent in wireless powered networks. In the proposed WPPAN, users harvest energy from spatially distributed pinching antennas in the downlink and use the collected power to transmit messages in the uplink. Furthermore, to manage the combinatorial complexity associated with activating the pinching antennas, we propose three approaches of varying complexity to simplify the original resource allocation problem and then solve it efficiently using convex optimization methods. Simulation results confirm that the proposed WPPAN system effectively mitigates the double near-far problem by providing antenna resources closer to the users, thereby enhancing both downlink energy harvesting and uplink data transmission.
中文: 本文提出了一种无线供电夹取天线网络(WPPAN),通过单波导分布式天线解决双近远问题,采用凸优化方法进行资源分配,仿真验证其能有效提升能量收集和数据传输性能。
English: This letter presents a wireless powered pinching-antenna network (WPPAN) that uses a single waveguide with distributed antennas to solve the double near-far problem, employing convex optimization for efficient resource allocation and demonstrating improved energy harvesting and data transmission through simulations.

Authors:Yuwei Zhang, Wenhao Yu, Shangbin Feng, Yifan Zhu, Letian Peng, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Title: Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
Abstract:
Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.
中文摘要:本研究提出WikiDYK基准,利用维基百科“你知道吗”条目评估大语言模型的知识记忆能力,发现因果语言模型比双向模型的准确率低23%,并通过协同框架将可靠性提升最高达29.1%。
English Summary: The study introduces WikiDYK, a scalable benchmark using Wikipedia's "Did You Know" entries to evaluate knowledge retention in large language models, revealing that causal language models underperform bidirectional models by 23% in accuracy, and proposes a collaborative framework that boosts reliability by up to 29.1%.

Authors:Zihao Zheng, Ziyao Wang, Xiuping Cui, Maoliang Li, Jiayu Chen, Yun, Liang, Ang Li, Xiang Chen
Title: FedHQ: Hybrid Runtime Quantization for Federated Learning
Abstract:
Federated Learning (FL) is a decentralized model training approach that preserves data privacy but struggles with low efficiency. Quantization, a powerful training optimization technique, has been widely explored for integration into FL. However, many studies fail to consider the distinct performance attribution between particular quantization strategies, such as post-training quantization (PTQ) or quantization-aware training (QAT). As a result, existing FL quantization methods rely solely on either PTQ or QAT, optimizing for speed or accuracy while compromising the other. To efficiently accelerate FL and maintain distributed convergence accuracy across various FL settings, this paper proposes a hybrid quantitation approach combining PTQ and QAT for FL systems. We conduct case studies to validate the effectiveness of using hybrid quantization in FL. To solve the difficulty of modeling speed and accuracy caused by device and data heterogeneity, we propose a hardware-related analysis and data-distribution-related analysis to help identify the trade-off boundaries for strategy selection. Based on these, we proposed a novel framework named FedHQ to automatically adopt optimal hybrid strategy allocation for FL systems. Specifically, FedHQ develops a coarse-grained global initialization and fine-grained ML-based adjustment to ensure efficiency and robustness. Experiments show that FedHQ achieves up to 2.47x times training acceleration and up to 11.15% accuracy improvement and negligible extra overhead.
中文: 本文提出一种结合训练后量化和量化感知训练的混合量化方法,通过FedHQ框架实现联邦学习效率与精度的协同优化,实验证明其能大幅加速训练并提升模型准确性。
English: This paper introduces a hybrid quantization approach combining post-training and quantization-aware training to enhance federated learning efficiency and accuracy, proposing the FedHQ framework for automatic strategy optimization that significantly accelerates training and improves performance.

Authors:Xiao Li, Tianhao Wei, Changliu Liu, Anouck Girard, Ilya Kolmanovsky
Title: Control Invariant Sets for Neural Network Dynamical Systems and Recursive Feasibility in Model Predictive Control
Abstract:
Neural networks are powerful tools for data-driven modeling of complex dynamical systems, enhancing predictive capability for control applications. However, their inherent nonlinearity and black-box nature challenge control designs that prioritize rigorous safety and recursive feasibility guarantees. This paper presents algorithmic methods for synthesizing control invariant sets specifically tailored to neural network based dynamical models. These algorithms employ set recursion, ensuring termination after a finite number of iterations and generating subsets in which closed-loop dynamics are forward invariant, thus guaranteeing perpetual operational safety. Additionally, we propose model predictive control designs that integrate these control invariant sets into mixed-integer optimization, with guaranteed adherence to safety constraints and recursive feasibility at the computational level. We also present a comprehensive theoretical analysis examining the properties and guarantees of the proposed methods. Numerical simulations in an autonomous driving scenario demonstrate the methods' effectiveness in synthesizing control-invariant sets offline and implementing model predictive control online, ensuring safety and recursive feasibility.
中文: 本文提出了针对神经网络动态模型的控制不变集合成算法,通过模型预测控制和理论保证,确保系统安全性和递归可行性。
English: This paper introduces algorithmic methods for synthesizing control invariant sets tailored to neural network-based dynamical models, ensuring safety and recursive feasibility through model predictive control and theoretical guarantees.

Authors:Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang
Title: DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator's ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents. To address these issues, we introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy with a multi-stage Curriculum Learning paradigm. The data augmentation strategy constructs comprehensive and diverse training sets with controllable difficulty levels through sample evolution, while the curriculum learning paradigm organizes them into progressive stages for training, ensuring stable and consistent improvements, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our DACL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
中文: DACL-RAG框架通过多级数据增强与渐进式课程学习,有效解决现有RAG训练中数据质量不均和区分度不足的问题,在四个开放域问答数据集上实现2-4%的性能提升。
English: The DACL-RAG framework addresses limitations in existing RAG training by integrating multi-level data augmentation and curriculum learning, achieving 2-4% performance gains across four QA datasets.

Authors:Yezi Liu, Prathyush Poduval, Wenjun Huang, Yang Ni, Hanning Chen, Mohsen Imani
Title: Enabling Group Fairness in Graph Unlearning via Bi-level Debiasing
Abstract:
Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across different sensitive groups often changes. Furthermore, graph models are shown to be prone to amplifying biases, making the study of fairness in graph unlearning particularly important. This raises the question: Does graph unlearning actually introduce bias? Our findings indicate that the predictions of post-unlearning models become highly correlated with sensitive attributes, confirming the introduction of bias in the graph unlearning process. To address this issue, we propose a fair graph unlearning method, FGU. To guarantee privacy, FGU trains shard models on partitioned subgraphs, unlearns the requested data from the corresponding subgraphs, and retrains the shard models on the modified subgraphs. To ensure fairness, FGU employs a bi-level debiasing process: it first enables shard-level fairness by incorporating a fairness regularizer in the shard model retraining, and then achieves global-level fairness by aligning all shard models to minimize global disparity. Our experiments demonstrate that FGU achieves superior fairness while maintaining privacy and accuracy. Additionally, FGU is robust to diverse unlearning requests, ensuring fairness and utility performance across various data distributions.
中文摘要:图遗忘虽对保护隐私至关重要,但会因预测与敏感属性关联而引入偏见,为此提出的FGU方法通过双层去偏处理在保障隐私与准确性的同时实现了公平性。
English Summary: Graph unlearning, while essential for privacy, can introduce bias by making predictions correlate with sensitive attributes, prompting the development of FGU, a method that ensures fairness through bi-level debiasing while maintaining privacy and accuracy.

Authors:Zhonggen Li, Xiangyu Ke, Yifan Zhu, Yunjun Gao, Feifei Li
Title: Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration
Abstract:
Graph embeddings provide continuous vector representations of nodes in a graph, which are widely applicable in community detection, recommendations, and various scientific fields. However, existing graph embedding systems either face scalability challenges due to the high cost of RAM and multiple GPUs, or rely on disk storage at the expense of I/O efficiency. In this paper, we propose Legend, a lightweight heterogeneous system for graph embedding that systematically redefines data management across CPU, GPU, and NVMe SSD resources. Legend is built on a foundation of efficient data placement and retrieval strategies tailored to the unique strengths of each hardware. Key innovations include a prefetch-friendly embedding loading strategy, enabling GPUs to directly prefetch data from SSDs with minimal I/O overhead, and a high-throughput GPU-SSD direct access driver optimized for graph embedding tasks. Furthermore, we propose a customized parallel execution strategy to maximize GPU utilization, ensuring efficient handling of billion-scale datasets. Extensive experiments demonstrate that Legend achieves up to 4.8x speedup compared to state-of-the-art systems. Moreover, Legend exhibits comparable performance on a single GPU to that of the state-of-the-art system using 4 GPUs on the billion-scale dataset.
中文: Legend提出了一种轻量级异构图嵌入系统,通过优化CPU、GPU和SSD间的数据管理,在十亿级数据集上实现了显著的加速和高效性能。
English: Legend introduces a lightweight heterogeneous system for graph embeddings that optimizes data management across CPU, GPU, and SSD hardware, achieving significant speedups and efficiency on billion-scale datasets.

Authors:Wenkai Li, Xiaoqi Li, Yingjie Mao, Yishun Wang
Title: Towards Understanding Deep Learning Model in Image Recognition via Coverage Test
Abstract:
Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.
中文: 本文通过实证研究分析了四种神经网络覆盖度量在不同架构和深度下的关系,并探讨了数据集大小对修改决策/条件覆盖的影响,以推动深度神经网络安全测试的发展。
English: This paper empirically investigates the relationships between four neural network coverage metrics—primary functionality, boundary, hierarchy, and structural coverage—across different DNN architectures and depths, while also exploring the impact of dataset size on modified decision/condition coverage to advance DNN security testing.

Authors:Yuntao Wang, Yanghe Pan, Shaolong Guo, Zhou Su
Title: Security of Internet of Agents: Attacks and Countermeasures
Abstract:
With the rise of large language and vision-language models, AI agents have evolved into autonomous, interactive systems capable of perception, reasoning, and decision-making. As they proliferate across virtual and physical domains, the Internet of Agents (IoA) has emerged as a key infrastructure for enabling scalable and secure coordination among heterogeneous agents. This survey offers a comprehensive examination of the security and privacy landscape in IoA systems. We begin by outlining the IoA architecture and its distinct vulnerabilities compared to traditional networks, focusing on four critical aspects: identity authentication threats, cross-agent trust issues, embodied security, and privacy risks. We then review existing and emerging defense mechanisms and highlight persistent challenges. Finally, we identify open research directions to advance the development of resilient and privacy-preserving IoA ecosystems.
中文: 本综述全面审视了智能体互联网(IoA)中的安全与隐私挑战,分析了身份认证、跨智能体信任、具身安全和隐私风险四大脆弱性领域,并梳理了现有防御机制与未来研究方向。
English: This survey comprehensively examines security and privacy challenges in the Internet of Agents (IoA), analyzing vulnerabilities across authentication, trust, embodied security, and privacy while reviewing defenses and identifying future research directions.

Authors:Zhe Li, Hadrien Reynaud, Bernhard Kainz
Title: Leveraging Multi-Modal Information to Enhance Dataset Distillation
Abstract:
Dataset distillation aims to create a compact and highly representative synthetic dataset that preserves the knowledge of a larger real dataset. While existing methods primarily focus on optimizing visual representations, incorporating additional modalities and refining object-level information can significantly improve the quality of distilled datasets. In this work, we introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking. To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation, where caption embeddings are fused with visual features at the classification stage, and caption matching, which introduces a caption-based alignment loss during training to ensure semantic coherence between real and synthetic data. Additionally, we apply segmentation masks to isolate target objects and remove background distractions, introducing two loss functions designed for object-centric learning: masked feature alignment loss and masked gradient matching loss. Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation, leading to synthetic datasets that achieve superior performance on downstream tasks.
中文: 本研究通过引入标题引导监督(特征拼接和标题匹配)和对象中心掩码(分割掩码及专用损失函数),增强了数据集蒸馏效果,使合成数据集在下游任务中表现更优。
English: This work enhances dataset distillation by incorporating caption-guided supervision through feature concatenation and caption matching, along with object-centric masking using segmentation masks and specialized loss functions, resulting in synthetic datasets with improved performance on downstream tasks.

Authors:Artem Shelmanov, Ekaterina Fadeeva, Akim Tsvigun, Ivan Tsvigun, Zhuohan Xie, Igor Kiselev, Nico Daheim, Caiqi Zhang, Artem Vazhentsev, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin
Title: A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs
Abstract:
Large Language Models (LLMs) have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information. This presents a major challenge, as hallucinations often appear highly convincing and users generally lack the tools to detect them. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the powerful Transformer architecture in their design and informative features derived from LLM attention maps. Experimental evaluation shows that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma 2. We publicly release both the code and the pre-trained heads.
中文: 本研究提出预训练不确定性量化模块,通过利用Transformer架构和注意力特征显著提升大语言模型的幻觉检测能力,在跨领域和跨语言任务中均展现出最先进的鲁棒性。
English: This work introduces pre-trained uncertainty quantification heads that significantly improve hallucination detection in large language models by leveraging Transformer architecture and attention features, achieving state-of-the-art robustness across domains and languages.

Authors:Chao Feng, Nicolas Huber, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller
Title: Demo: A Practical Testbed for Decentralized Federated Learning on Physical Edge Devices
Abstract:
Federated Learning (FL) enables collaborative model training without sharing raw data, preserving participant privacy. Decentralized FL (DFL) eliminates reliance on a central server, mitigating the single point of failure inherent in the traditional FL paradigm, while introducing deployment challenges on resource-constrained devices. To evaluate real-world applicability, this work designs and deploys a physical testbed using edge devices such as Raspberry Pi and Jetson Nano. The testbed is built upon a DFL training platform, NEBULA, and extends it with a power monitoring module to measure energy consumption during training. Experiments across multiple datasets show that model performance is influenced by the communication topology, with denser topologies leading to better outcomes in DFL settings.
中文: 联邦学习无需共享原始数据即可协同训练模型以保护隐私,而分散式联邦学习通过消除中央服务器避免了单点故障,但在资源受限设备上部署存在挑战,实验表明更密集的通信拓扑能提升模型性能。
English: Federated Learning allows collaborative model training without sharing raw data to protect privacy, while Decentralized Federated Learning removes the central server to avoid single points of failure but faces deployment challenges on resource-limited devices, with experiments showing that denser communication topologies improve model performance.

Authors:Letian Peng, Jingbo Shang
Title: Codifying Character Logic in Role-Playing
Abstract:
This paper introduces Codified Profiles for role-playing, a novel approach that represents character logic as structured, executable functions for behavioral decision-making. Each profile defines a set of functions parse_by_scene(scene) that outputs a list of logic-grounded assertions triggered_statements, using both explicit control structures (e.g., if-then-else) and condition checks like check_condition(scene, question), where each question is a semantically meaningful prompt about the scene (e.g., "Is the character in danger?") discriminated by the role-playing LLM as true, false, or unknown. This explicit representation offers three key advantages over traditional prompt-based profiles, which append character descriptions directly into text prompts: (1) Persistence, by enforcing complete and consistent execution of character logic, rather than relying on the model's implicit reasoning; (2) Updatability, through systematic inspection and revision of behavioral logic, which is difficult to track or debug in prompt-only approaches; (3) Controllable Randomness, by supporting stochastic behavior directly within the logic, enabling fine-grained variability that prompting alone struggles to achieve. To validate these advantages, we introduce a new benchmark constructed from 83 characters and 5,141 scenes curated from Fandom, using NLI-based scoring to compare character responses against ground-truth actions. Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing a scalable and efficient foundation for local deployment of role-play agents.
中文: 本文提出代码化角色档案这一创新方法,将角色逻辑构建为可执行函数,相比传统提示方法具有更强的持久性、可更新性和可控随机性优势,实验证明该方法能显著提升行为一致性与多样性,并使轻量级模型也能实现高质量角色扮演。
English: This paper presents Codified Profiles, a novel method that structures character logic into executable functions for role-playing, offering enhanced persistence, updatability, and controllable randomness over traditional prompt-based approaches, with experiments showing significant improvements in behavioral consistency and diversity even for smaller models.

Authors:Songlin Dong, Chenhao Ding, Jiangyang Li, Jizhou Han, Qiang Wang, Yuhang He, Yihong Gong
Title: Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models
Abstract:
This study aims to address the problem of multi-domain task incremental learning~(MTIL), which requires that vision-language models~(VLMs) continuously acquire new knowledge while maintaining their inherent zero-shot recognition capability. Existing paradigms delegate the testing of unseen-domain samples to the original CLIP, which only prevents the degradation of the model's zero-shot capability but fails to enhance the generalization of the VLM further. To this end, we propose a novel MTIL framework, named AFA, which comprises two core modules: (1) an against forward-forgetting adapter that learns task-invariant information for each dataset in the incremental tasks to enhance the zero-shot recognition ability of VLMs; (2) an against backward-forgetting adapter that strengthens the few-shot learning capability of VLMs while supporting incremental learning. Extensive experiments demonstrate that the AFA method significantly outperforms existing state-of-the-art approaches, especially in few-shot MTIL tasks, and surpasses the inherent zero-shot performance of CLIP in terms of transferability. The code is provided in the Supplementary Material.
中文: 本研究提出AFA框架,通过防止前向与后向遗忘来增强视觉语言模型的多领域任务增量学习能力,在零样本和小样本任务中显著超越现有方法并突破CLIP的固有性能。
English: This study introduces the AFA framework to enhance multi-domain task incremental learning in vision-language models by preventing both forward and backward forgetting, significantly improving zero-shot and few-shot capabilities beyond existing methods.

Authors:Manuel Barusco, Francesco Borsatti, Youssef Ben Khalifa, Davide Dalle Pezze, Gian Antonio Susto
Title: Evaluating Modern Visual Anomaly Detection Approaches in Semiconductor Manufacturing: A Comparative Study
Abstract:
Semiconductor manufacturing is a complex, multistage process. Automated visual inspection of Scanning Electron Microscope (SEM) images is indispensable for minimizing equipment downtime and containing costs. Most previous research considers supervised approaches, assuming a sufficient number of anomalously labeled samples. On the contrary, Visual Anomaly Detection (VAD), an emerging research domain, focuses on unsupervised learning, avoiding the costly defect collection phase while providing explanations of the predictions. We introduce a benchmark for VAD in the semiconductor domain by leveraging the MIIC dataset. Our results demonstrate the efficacy of modern VAD approaches in this field.
中文: 半导体制造中,基于扫描电镜图像的自动化视觉检测至关重要,而新兴的无监督视觉异常检测方法通过MIIC数据集验证了其在无需标注缺陷情况下的高效性能。
English: Automated visual inspection using SEM images is crucial in semiconductor manufacturing, and a new benchmark for unsupervised Visual Anomaly Detection (VAD) demonstrates its effectiveness in detecting defects without labeled data.

Authors:Heqing Ren, Chao Feng, Alberto Huertas, Burkhard Stiller
Title: AugMixCloak: A Defense against Membership Inference Attacks via Image Transformation
Abstract:
Traditional machine learning (ML) raises serious privacy concerns, while federated learning (FL) mitigates the risk of data leakage by keeping data on local devices. However, the training process of FL can still leak sensitive information, which adversaries may exploit to infer private data. One of the most prominent threats is the membership inference attack (MIA), where the adversary aims to determine whether a particular data record was part of the training set. This paper addresses this problem through a two-stage defense called AugMixCloak. The core idea is to apply data augmentation and principal component analysis (PCA)-based information fusion to query images, which are detected by perceptual hashing (pHash) as either identical to or highly similar to images in the training set. Experimental results show that AugMixCloak successfully defends against both binary classifier-based MIA and metric-based MIA across five datasets and various decentralized FL (DFL) topologies. Compared with regularization-based defenses, AugMixCloak demonstrates stronger protection. Compared with confidence score masking, AugMixCloak exhibits better generalization.
Chinese: 联邦学习虽降低数据泄露风险,但仍面临成员推断攻击的威胁;AugMixCloak通过数据增强和基于PCA的信息融合技术,在多种分布式场景下有效防御此类攻击,其保护能力和泛化性能均优于现有防御方法。
English: Federated learning reduces data leakage risks but remains vulnerable to membership inference attacks, which AugMixCloak counters through data augmentation and PCA-based fusion to protect training data across diverse settings, outperforming existing defenses in both protection and generalization.

Authors:Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Marco Fabris, Gian Antonio Susto
Title: Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression
Abstract:
Visual Anomaly Detection (VAD) is a key task in industrial settings, where minimizing operational costs is essential. Deploying deep learning models within Internet of Things (IoT) environments introduces specific challenges due to limited computational power and bandwidth of edge devices. This study investigates how to perform VAD effectively under such constraints by leveraging compact, efficient processing strategies. We evaluate several data compression techniques, examining the tradeoff between system latency and detection accuracy. Experiments on the MVTec AD benchmark demonstrate that significant compression can be achieved with minimal loss in anomaly detection performance compared to uncompressed data. Current results show up to 80% reduction in end-to-end inference time, including edge processing, transmission, and server computation.
中文: 本研究探索在物联网环境下通过数据压缩技术实现高效的视觉异常检测,在MVTec AD基准测试中以最小精度损失实现了高达80%的推理加速。
English: This study explores efficient Visual Anomaly Detection in IoT environments by employing data compression techniques, achieving up to 80% faster inference with minimal accuracy loss on the MVTec AD benchmark.

Authors:Maximilian Egger, Rawad Bitar, Rüdiger Urbanke
Title: Efficient Machine Unlearning by Model Splitting and Core Sample Selection
Abstract:
Machine unlearning is essential for meeting legal obligations such as the right to be forgotten, which requires the removal of specific data from machine learning models upon request. While several approaches to unlearning have been proposed, existing solutions often struggle with efficiency and, more critically, with the verification of unlearning - particularly in the case of weak unlearning guarantees, where verification remains an open challenge. We introduce a generalized variant of the standard unlearning metric that enables more efficient and precise unlearning strategies. We also present an unlearning-aware training procedure that, in many cases, allows for exact unlearning. We term our approach MaxRR. When exact unlearning is not feasible, MaxRR still supports efficient unlearning with properties closely matching those achieved through full retraining.
中文: 作者提出MaxRR机器遗忘方法,通过广义化度量指标实现更高效的遗忘验证,并采用遗忘感知训练程序在可行时实现精确遗忘,在其他情况下仍能保持接近完全重新训练的效果。
English: The authors propose MaxRR, a machine unlearning method featuring a generalized metric for more efficient unlearning verification and an unlearning-aware training procedure that enables exact unlearning where possible, closely matching full retraining effectiveness otherwise.

Authors:Maximilian Egger, Rawad Bitar, Antonia Wachter-Zeh, Nir Weinberger, Deniz Gündüz
Title: Multi-Terminal Remote Generation and Estimation Over a Broadcast Channel With Correlated Priors
Abstract:
We study the multi-terminal remote estimation problem under a rate constraint, in which the goal of the encoder is to help each decoder estimate a function over a certain distribution -- while the distribution is known only to the encoder, the function to be estimated is known only to the decoders, and can also be different for each decoder. The decoders can observe correlated samples from prior distributions, instantiated through shared randomness with the encoder. To achieve this, we employ remote generation, where the encoder helps decoders generate samples from the underlying distribution by using the samples from the prior through importance sampling. While methods such as minimal random coding can be used to efficiently transmit samples to each decoder individually using their importance scores, it is unknown if the correlation among the samples from the priors can reduce the communication cost using the availability of a broadcast link. We propose a hierarchical importance sampling strategy that facilitates, in the case of non-zero Gács-Körner common information among the priors of the decoders, a common sampling step leveraging the availability of a broadcast channel. This is followed by a refinement step for the individual decoders. We present upper bounds on the bias and the estimation error for unicast transmission, which is of independent interest. We then introduce a method that splits into two phases, dedicated to broadcast and unicast transmission, respectively, and show the reduction in communication cost.
中文: 本研究提出一种分层重要性采样方法,通过利用共享随机性和广播信道来降低多终端远程估计中的通信成本,同时确保估计误差有界。
English: This research introduces a hierarchical importance sampling method for multi-terminal remote estimation, leveraging shared randomness and a broadcast channel to reduce communication costs while bounding estimation errors.

Authors:Maximilian Egger, Svenja Lage, Rawad Bitar, Antonia Wachter-Zeh
Title: Source Anonymity for Private Random Walk Decentralized Learning
Abstract:
This paper considers random walk-based decentralized learning, where at each iteration of the learning process, one user updates the model and sends it to a randomly chosen neighbor until a convergence criterion is met. Preserving data privacy is a central concern and open problem in decentralized learning. We propose a privacy-preserving algorithm based on public-key cryptography and anonymization. In this algorithm, the user updates the model and encrypts the result using a distant user's public key. The encrypted result is then transmitted through the network with the goal of reaching that specific user. The key idea is to hide the source's identity so that, when the destination user decrypts the result, it does not know who the source was. The challenge is to design a network-dependent probability distribution (at the source) over the potential destinations such that, from the receiver's perspective, all users have a similar likelihood of being the source. We introduce the problem and construct a scheme that provides anonymity with theoretical guarantees. We focus on random regular graphs to establish rigorous guarantees.
中文: 本文提出了一种基于公钥密码学和匿名化的隐私保护去中心化学习算法,通过在模型传输过程中隐藏发送者身份,并在随机正则图上建立了理论匿名性保证。
English: This paper introduces a privacy-preserving decentralized learning algorithm that uses public-key cryptography and anonymization to conceal the source's identity during model transmission, ensuring theoretical anonymity guarantees on random regular graphs.

Authors:Zhe Li, Hadrien Reynaud, Mischa Dombrowski, Sarah Cechnicka, Franciskus Xaverius Erick, Bernhard Kainz
Title: Video Dataset Condensation with Diffusion Models
Abstract:
In recent years, the rapid expansion of dataset sizes and the increasing complexity of deep learning models have significantly escalated the demand for computational resources, both for data storage and model training. Dataset distillation has emerged as a promising solution to address this challenge by generating a compact synthetic dataset that retains the essential information from a large real dataset. However, existing methods often suffer from limited performance and poor data quality, particularly in the video domain. In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos. To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos that effectively captures the characteristics of the original dataset. To further optimize computational efficiency, we explore a training-free clustering algorithm, Temporal-Aware Cluster-based Distillation (TAC-DT), to select representative videos without requiring additional training overhead. We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to \(10.61\%\) over the state-of-the-art. Our method consistently outperforms existing approaches across all datasets, establishing a new benchmark for video dataset distillation.
Chinese: 本文提出了一种新颖的视频数据集蒸馏方法,通过视频扩散模型和视频时空U-Net生成高质量合成视频,在基准数据集上相比现有最优方法实现了高达10.61%的性能提升。
English: This paper introduces a novel video dataset distillation method using a video diffusion model and a Video Spatio-Temporal U-Net to generate high-quality synthetic videos, achieving up to 10.61% performance improvement over state-of-the-art methods on benchmark datasets.

Authors:Faisal Haque Bappy, EunJeong Cheon, Tariqul Islam
Title: Centralized Trust in Decentralized Systems: Unveiling Hidden Contradictions in Blockchain and Cryptocurrency
Abstract:
Blockchain technology promises to democratize finance and promote social equity through decentralization, but questions remain about whether current implementations advance or hinder these goals. Through a mixed-methods study combining semi-structured interviews with 13 diverse blockchain stakeholders and analysis of over 3,000 cryptocurrency discussions on Reddit, we examine how trust manifests in cryptocurrency ecosystems despite their decentralized architecture. Our findings uncover that users actively seek out and create centralized trust anchors, such as established exchanges, prominent community figures, and recognized development teams, contradicting blockchain's fundamental promise of trustless interactions. We identify how this contradiction arises from users' mental need for accountability and their reluctance to shoulder the full responsibility of self-custody. The study also reveals how these centralized trust patterns disproportionately impact different user groups, with newer and less technical users showing stronger preferences for centralized intermediaries. This work contributes to our understanding of the inherent tensions between theoretical decentralization and practical implementation in cryptocurrency systems, highlighting the persistent role of centralized trust in supposedly trustless environments.
中文摘要:尽管区块链技术承诺去中心化,但用户在实践中反而依赖交易所和意见领袖等中心化信任锚,揭示了理论上的无需信任理念与实际对责任主体的需求之间存在根本矛盾。
English Summary: Despite blockchain's promise of decentralization, users paradoxically create centralized trust anchors like exchanges and influencers, revealing a contradiction between theoretical trustless ideals and practical reliance on accountability.

Authors:Junyi Peng, Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Černocký
Title: TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
Abstract:
Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.
中文: 本文提出TS-SUPERB基准,用于评估自监督学习模型在嘈杂多人说话环境中的目标说话人任务表现,发现其性能与单说话人场景存在显著差异,并验证了跨任务联合优化的有效性。
English: This paper introduces TS-SUPERB, a benchmark evaluating self-supervised learning models on target-speaker tasks in noisy multi-talker environments, revealing their distinct performance from single-speaker scenarios and demonstrating the effectiveness of joint optimization across tasks.

Authors:Minting Pan, Yitao Zheng, Jiajian Li, Yunbo Wang, Xiaokang Yang
Title: Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach
Abstract:
Offline reinforcement learning (RL) enables policy optimization using static datasets, avoiding the risks and costs of extensive real-world exploration. However, it struggles with suboptimal offline behaviors and inaccurate value estimation due to the lack of environmental interaction. We present Video-Enhanced Offline RL (VeoRL), a model-based method that constructs an interactive world model from diverse, unlabeled video data readily available online. Leveraging model-based behavior guidance, our approach transfers commonsense knowledge of control policy and physical dynamics from natural videos to the RL agent within the target domain. VeoRL achieves substantial performance gains (over 100% in some cases) across visual control tasks in robotic manipulation, autonomous driving, and open-world video games.
Chinese: VeoRL是一种基于模型的离线强化学习方法,通过利用多样化的在线视频数据构建交互式世界模型,在机器人操控、自动驾驶和开放世界游戏等视觉控制任务中实现了超过100%的性能提升。
English: VeoRL is a model-based offline reinforcement learning method that enhances policy optimization by leveraging diverse online video data to build an interactive world model, achieving over 100% performance improvements in various visual control tasks.

Authors:Yuki Kadokawa, Jonas Frey, Takahiro Miki, Takamitsu Matsubara, Marco Hutter
Title: DAPPER: Discriminability-Aware Policy-to-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition
Abstract:
Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing trajectories from a single policy. While human responses to these queries make it possible to learn policies aligned with human preferences, PbRL suffers from low query efficiency, as policy bias limits trajectory diversity and reduces the number of discriminable queries available for learning preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behavior, as a key metric for improving query efficiency. To address this, we move beyond comparisons within a single policy and instead generate queries by comparing trajectories from multiple policies, as training them from scratch promotes diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory diversification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and preference discriminability score, encouraging the discovery of highly rewarding and easily distinguishable policies. Experiments in simulated and real-world legged robot environments demonstrate that DAPPER outperforms previous methods in query efficiency, particularly under challenging preference discriminability conditions.
中文摘要:本文提出DAPPER方法,通过多策略生成多样化轨迹查询并利用可区分性度量进行优先采样,显著提升了基于偏好的强化学习在挑战性环境中的查询效率。
English Summary: This paper introduces DAPPER, a novel preference-based reinforcement learning method that enhances query efficiency by generating discriminable queries through multiple diverse policies and prioritizing them using a learned discriminability metric.

Authors:Weiyi Zhang, Peranut Chotcomwongse, Yinwen Li, Pusheng Xu, Ruijie Yao, Lianhao Zhou, Yuxuan Zhou, Hui Feng, Qiping Zhou, Xinyue Wang, Shoujin Huang, Zihao Jin, Florence H. T. Chung, Shujun Wang, Yalin Zheng, Mingguang He, Danli Shi, Paisan Ruamviboonsuk
Title: Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition
Abstract:
Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance this research, we organized the 2nd Asia-Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition in 2021. The competition focused on improving predictive accuracy for anti-VEGF therapy responses using ophthalmic OCT images. We provided a dataset containing tens of thousands of OCT images from 2,000 patients with labels across four sub-tasks. This paper details the competition's structure, dataset, leading methods, and evaluation metrics. The competition attracted strong scientific community participation, with 170 teams initially registering and 41 reaching the final round. The top-performing team achieved an AUC of 80.06%, highlighting the potential of AI in personalized DME treatment and clinical decision-making.
中文摘要:本研究通过组织竞赛开发基于OCT图像的AI模型预测糖尿病黄斑水肿治疗反应,取得显著成果,为个性化医疗提供了有力支持。
English Summary: This study organized a competition to develop AI models for predicting diabetic macular edema treatment responses using OCT images, achieving promising results that support personalized medicine.

Authors:Yue Wu, Yibo Guo, Yulong Yan, Jiancheng Yang, Xin Zhou, Ching-Yu Cheng, Danli Shi, Mingguang He
Title: AI-powered virtual eye: perspective, challenges and opportunities
Abstract:
We envision the "virtual eye" as a next-generation, AI-powered platform that uses interconnected foundation models to simulate the eye's intricate structure and biological function across all scales. Advances in AI, imaging, and multiomics provide a fertile ground for constructing a universal, high-fidelity digital replica of the human eye. This perspective traces the evolution from early mechanistic and rule-based models to contemporary AI-driven approaches, integrating in a unified model with multimodal, multiscale, dynamic predictive capabilities and embedded feedback mechanisms. We propose a development roadmap emphasizing the roles of large-scale multimodal datasets, generative AI, foundation models, agent-based architectures, and interactive interfaces. Despite challenges in interpretability, ethics, data processing and evaluation, the virtual eye holds the potential to revolutionize personalized ophthalmic care and accelerate research into ocular health and disease.
中文: “虚拟眼”是一个基于人工智能的平台,通过整合多模态、多尺度的数据构建动态的人眼数字模型,有望革新个性化眼科诊疗并加速眼部疾病研究,尽管面临伦理和数据处理的挑战。
English: The "virtual eye" is a proposed AI-driven platform that integrates multimodal, multiscale data to create a dynamic digital replica of the human eye, aiming to transform personalized ophthalmology and accelerate ocular research despite challenges in ethics and data processing.

Authors:Min Chen, Jinglei Cheng, Pingzhi Li, Haoran Wang, Tianlong Chen, Junyu Liu
Title: GroverGPT-2: Simulating Grover's Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
Abstract:
Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, prompting exploration into their potential for this challenge. In this work, we introduce GroverGPT-2, an LLM-based method for simulating Grover's algorithm using Chain-of-Thought (CoT) reasoning and quantum-native tokenization. Building on its predecessor, GroverGPT-2 performs simulation directly from quantum circuit representations while producing logically structured and interpretable outputs. Our results show that GroverGPT-2 can learn and internalize quantum circuit logic through efficient processing of quantum-native tokens, providing direct evidence that classical models like LLMs can capture the structure of quantum algorithms. Furthermore, GroverGPT-2 outputs interleave circuit data with natural language, embedding explicit reasoning into the simulation. This dual capability positions GroverGPT-2 as a prototype for advancing machine understanding of quantum algorithms and modeling quantum circuit logic. We also identify an empirical scaling law for GroverGPT-2 with increasing qubit numbers, suggesting a path toward scalable classical simulation. These findings open new directions for exploring the limits of classical simulatability, enhancing quantum education and research, and laying groundwork for future foundation models in quantum computing.
中文: GroverGPT-2 通过量子原生标记处理和生成带有嵌入式推理的可解释输出,证明了大语言模型能够模拟量子算法,为量子电路的经典模拟提供了可扩展的途径。
English: GroverGPT-2 demonstrates that large language models can simulate quantum algorithms by processing quantum-native tokens and generating interpretable outputs with embedded reasoning, offering a scalable approach to classical simulation of quantum circuits.

Authors:Xinyi Wang, Shaukat Ali, Paolo Arcaini
Title: Quantum Artificial Intelligence for Software Engineering: the Road Ahead
Abstract:
Artificial Intelligence (AI) has been applied to various areas of software engineering, including requirements engineering, coding, testing, and debugging. This has led to the emergence of AI for Software Engineering as a distinct research area within software engineering. With the development of quantum computing, the field of Quantum AI (QAI) is arising, enhancing the performance of classical AI and holding significant potential for solving classical software engineering problems. Some initial applications of QAI in software engineering have already emerged, such as software test optimization. However, the path ahead remains open, offering ample opportunities to solve complex software engineering problems with QAI cost-effectively. To this end, this paper presents open research opportunities and challenges in QAI for software engineering that need to be addressed.
中文: 人工智能已广泛应用于软件工程的各个领域,而新兴的量子人工智能通过提升性能为解决复杂软件工程问题提供了潜力,但仍需探索开放的研究机遇与挑战。
English: Artificial Intelligence is transforming software engineering through applications in requirements, coding, and testing, while emerging Quantum AI promises enhanced performance and cost-effective solutions for complex challenges, though open research opportunities remain.

Authors:Ahmed A. Metwally, A. Ali Heydari, Daniel McDuff, Alexandru Solot, Zeinab Esmaeilpour, Anthony Z Faranesh, Menglian Zhou, David B. Savage, Conor Heneghan, Shwetak Patel, Cathy Speed, Javier L. Prieto
Title: Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers
Abstract:
Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground-truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA-IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA-IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.
中文: 本研究利用可穿戴设备数据和血液生物标志物开发了深度神经网络模型,能准确预测胰岛素抵抗,在2型糖尿病早期筛查和个性化风险评估方面展现出优越性能。
English: This study developed deep neural network models using wearable device data and blood biomarkers to accurately predict insulin resistance, demonstrating superior performance in early detection and personalized risk assessment for type 2 diabetes prevention.

Authors:Haoyue Liu, Jinghan Xu, Yi Chang, Hanyu Zhou, Haozhi Zhao, Lin Wang, Luxin Yan
Title: TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion
Abstract:
Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras' advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse optical flow or fuse events with image features to estimate dense optical flow. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events do not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated through global motion optimization and frame refinement. Moreover, we collect a real-world dataset that features fast non-linear motion. Extensive experiments show that our method outperforms prior arts in both motion estimation and frame interpolation quality.
中文摘要:提出的TimeTracker框架通过连续点追踪技术处理非线性运动,借助场景感知区域分割和轨迹引导运动估计,在运动估计和帧插值质量上均优于现有方法。
English Summary: The proposed TimeTracker framework improves video frame interpolation by using continuous point tracking to handle non-linear motion, achieving superior performance in motion estimation and frame quality through scene-aware region segmentation and trajectory-guided motion estimation.

Authors:Xiaofeng Liu, Yongsong Huang, Thibault Marin, Samira Vafay Eslahi, Tiss Amal, Yanis Chemli, Keith Johnson, Georges El Fakhri, Jinsong Ouyang
Title: Dual Prompting for Diverse Count-level PET Denoising
Abstract:
The to-be-denoised positron emission tomography (PET) volumes are inherent with diverse count levels, which imposes challenges for a unified model to tackle varied cases. In this work, we resort to the recently flourished prompt learning to achieve generalizable PET denoising with different count levels. Specifically, we propose dual prompts to guide the PET denoising in a divide-and-conquer manner, i.e., an explicitly count-level prompt to provide the specific prior information and an implicitly general denoising prompt to encode the essential PET denoising knowledge. Then, a novel prompt fusion module is developed to unify the heterogeneous prompts, followed by a prompt-feature interaction module to inject prompts into the features. The prompts are able to dynamically guide the noise-conditioned denoising process. Therefore, we are able to efficiently train a unified denoising model for various count levels, and deploy it to different cases with personalized prompts. We evaluated on 1940 low-count PET 3D volumes with uniformly randomly selected 13-22\% fractions of events from 97 $^{18}$F-MK6240 tau PET studies. It shows our dual prompting can largely improve the performance with informed count-level and outperform the count-conditional model.
中文: 本研究提出一种双提示学习方法,通过显式计数水平和隐式去噪提示动态引导PET去噪,使统一模型能有效处理不同计数水平,显著提升性能。
English: This study introduces a dual-prompt learning approach for generalizable PET denoising, using explicit count-level and implicit denoising prompts that dynamically guide noise removal, enabling a unified model to handle diverse count levels effectively.

Authors:Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang
Title: AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation
Abstract:
Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level understanding and interaction, and (2) Limited accuracy and interpretability due to single-step reasoning. In this paper, we empower MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.
中文摘要:本文提出了一种以解剖学本体为指导的推理框架,通过增强区域级理解和多步推理能力来改进医学大型多模态模型,并在视觉问答和报告生成任务中验证了其优越性能。
English Summary: This paper introduces an Anatomical Ontology-Guided Reasoning (AOR) framework to enhance medical large multimodal models by improving region-level understanding and multi-step reasoning, validated through superior performance in visual question answering and report generation.

Authors:Sahan Liyanaarachchi, Sennur Ulukus, Nail Akar
Title: Structured Estimators: A New Perspective on Information Freshness
Abstract:
In recent literature, when modeling for information freshness in remote estimation settings, estimators have been mainly restricted to the class of martingale estimators, meaning the remote estimate at any time is equal to the most recently received update. This is mainly due to its simplicity and ease of analysis. However, these martingale estimators are far from optimal in some cases, especially in pull-based update systems. For such systems, maximum aposteriori probability (MAP) estimators are optimum, but can be challenging to analyze. Here, we introduce a new class of estimators, called structured estimators, which retain useful characteristics from a MAP estimate while still being analytically tractable. Our proposed estimators move seamlessly from a martingale estimator to a MAP estimator.
中文: 该摘要提出结构化估计器作为一类新型估计方法,它融合了简单但次优的鞅估计器和最优但复杂的MAP估计器的优点,在拉动式更新系统中兼具分析易处理性和适应性。
English: The abstract introduces structured estimators as a novel class that bridges the gap between simple but suboptimal martingale estimators and optimal yet complex MAP estimators, offering analytical tractability and adaptability in pull-based systems.

Authors:Hongbo Zhao, Ziwei Long, Mengtan Zhang, Hanli Wang, Qijun Chen, Rui Fan
Title: A Birotation Solution for Relative Pose Problems
Abstract:
Relative pose estimation, a fundamental computer vision problem, has been extensively studied for decades. Existing methods either estimate and decompose the essential matrix or directly estimate the rotation and translation to obtain the solution. In this article, we break the mold by tackling this traditional problem with a novel birotation solution. We first introduce three basis transformations, each associated with a geometric metric to quantify the distance between the relative pose to be estimated and its corresponding basis transformation. Three energy functions, designed based on these metrics, are then minimized on the Riemannian manifold $\mathrm{SO(3)}$ by iteratively updating the two rotation matrices. The two rotation matrices and the basis transformation corresponding to the minimum energy are ultimately utilized to recover the relative pose. Extensive quantitative and qualitative evaluations across diverse relative pose estimation tasks demonstrate the superior performance of our proposed birotation solution. Source code, demo video, and datasets will be available at \href{https://mias.group/birotation-solution}{mias.group/birotation-solution} upon publication.
中文: 本文提出了一种新颖的双旋转解决方案,通过最小化黎曼流形上的三个能量函数来迭代更新旋转矩阵,在多种相对姿态估计任务中展现出卓越性能。
English: This paper introduces a novel birotation solution for relative pose estimation, which minimizes three energy functions on the Riemannian manifold to iteratively update rotation matrices and achieve superior performance across diverse tasks.

Authors:Zhengyuan Shi, Zeju Li, Chengyu Ma, Yunhao Zhou, Ziyang Zheng, Jiawei Liu, Hongyang Pan, Lingfeng Zhou, Kezhi Li, Jiaying Zhu, Lingwei Yan, Zhiqiang He, Chenhao Xue, Wentao Jiang, Fan Yang, Guangyu Sun, Xiaoyan Yang, Gang Chen, Chuan Shi, Zhufei Chu, Jun Yang, Qiang Xu
Title: ForgeEDA: A Comprehensive Multimodal Dataset for Advancing EDA
Abstract:
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA's utility by benchmarking state-of-the-art EDA algorithms on critical tasks such as Power, Performance, and Area (PPA) optimization, highlighting its ability to expose performance gaps and drive advancements. Additionally, ForgeEDA's scale and diversity facilitate the training of AI models for EDA tasks, demonstrating its potential to improve model performance and generalization. By addressing limitations in existing datasets, ForgeEDA aims to catalyze breakthroughs in modern IC design and support the next generation of innovations in EDA.
中文: ForgeEDA是一个包含多种电路表示的开源电路数据集,能够支持全面的EDA算法评估和人工智能模型训练,旨在推动集成电路设计领域的创新发展。
English: ForgeEDA is an open-source circuit dataset featuring multiple circuit representations that enable comprehensive EDA algorithm benchmarking and AI model training, aiming to drive innovations in integrated circuit design.

Authors:Kaidong Zhang, Rongtao Xu, Pengzhen Ren, Junfan Lin, Hefeng Wu, Liang Lin, Xiaodan Liang
Title: RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
Abstract:
Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
中文摘要:RoBridge是一种分层智能架构,通过结合高层认知规划与强化学习,有效解决了机器人操作中的程序性和陈述性技能困境,在新任务和仿真到现实的迁移中均实现了高成功率。
English Summary: RoBridge is a hierarchical architecture that integrates cognitive planning with reinforcement learning to overcome procedural and declarative skill limitations in robotic manipulation, achieving high success rates in new tasks and sim-to-real generalization.

Authors:Alireza Sadeghi, Farshid Hajati, Ahmadreza Argha, Nigel H Lovell, Min Yang, Hamid Alinejad-Rokny
Title: Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking
Abstract:
Integrating heterogeneous biomedical data including imaging, omics, and clinical records supports accurate diagnosis and personalised care. Graph-based models fuse such non-Euclidean data by capturing spatial and relational structure, yet clinical uptake requires regulator-ready interpretability. We present the first technical survey of interpretable graph based models for multimodal biomedical data, covering 26 studies published between Jan 2019 and Sep 2024. Most target disease classification, notably cancer and rely on static graphs from simple similarity measures, while graph-native explainers are rare; post-hoc methods adapted from non-graph domains such as gradient saliency, and SHAP predominate. We group existing approaches into four interpretability families, outline trends such as graph-in-graph hierarchies, knowledge-graph edges, and dynamic topology learning, and perform a practical benchmark. Using an Alzheimer disease cohort, we compare Sensitivity Analysis, Gradient Saliency, SHAP and Graph Masking. SHAP and Sensitivity Analysis recover the broadest set of known AD pathways and Gene-Ontology terms, whereas Gradient Saliency and Graph Masking surface complementary metabolic and transport signatures. Permutation tests show all four beat random gene sets, but with distinct trade-offs: SHAP and Graph Masking offer deeper biology at higher compute cost, while Gradient Saliency and Sensitivity Analysis are quicker though coarser. We also provide a step-by-step flowchart covering graph construction, explainer choice and resource budgeting to help researchers balance transparency and performance. This review synthesises the state of interpretable graph learning for multimodal medicine, benchmarks leading techniques, and charts future directions, from advanced XAI tools to under-studied diseases, serving as a concise reference for method developers and translational scientists.
中文: 该综述系统梳理了多模态生物医学数据的可解释图学习方法,通过基准测试比较SHAP和梯度显著性等方法在揭示生物学机制方面的优劣,并为平衡模型透明度与性能提供了实用流程指南。
English: This survey synthesizes interpretable graph-based models for integrating multimodal biomedical data, benchmarking methods like SHAP and Gradient Saliency to reveal biological insights while providing practical guidelines for balancing transparency and performance.

Authors:Phuoc Pham, Murali Sridharan, Matteo Esposito, Valentina Lenarduzzi
Title: Descriptor: C++ Self-Admitted Technical Debt Dataset (CppSATD)
Abstract:
In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted'' by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.
中文摘要:技术债指软件开发中为赶工期而做出的次优实现选择,其中自承技术债是开发者公开承认的特定技术债实例;以往研究多集中于Java语言,限制了跨语言适用性,本文通过构建C++专用数据集解决了这一局限。
English Summary: Technical debt in software development involves suboptimal choices for quick delivery, with self-admitted technical debt (SATD) being openly acknowledged by developers, yet previous research has been limited to Java, hindering cross-language applicability.

Authors:Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, Guilin Liu
Title: Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
Abstract:
Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text space. To enhance LLMs' tool-calling abilities, previous approaches primarily rely on supervised fine-tuning (SFT) with trajectories distilled from stronger models, often resulting in imitative reasoning that limits generalization. In this work, we explore rule-based reinforcement learning to enhance tool-calling in LLMs, resulting in Nemotron-Research-Tool-N1, a series of tool-calling reasoning models. Rather than enforcing supervision over intermediate distilled reasoning traces, Tool-N1 is trained with a binary RL reward that assesses only the format validity and functional correctness of tool invocations. This lightweight supervision allows the model to develop reasoning strategies independently, without relying on annotated trajectories. Experiments on several major benchmarks show that Tool-N1-7B/14B clearly outperform GPT-4o. We conduct a systematic study on the design of rule-based reinforcement learning strategies for training tool-calling models. Using 5,518 distilled reasoning trajectories, we compare SFT, RL, and the SFT-then-RL pipeline, finding that the widely adopted SFT-then-RL paradigm does not necessarily outperform pure RL.
Chinese: 本研究推出了Nemotron-Research-Tool-N1系列工具调用推理模型,通过基于规则的强化学习训练,仅使用评估格式有效性和功能正确性的二元奖励机制,使模型能自主发展推理策略,在多项基准测试中超越GPT-4o且无需依赖标注轨迹。
English: This research introduces Nemotron-Research-Tool-N1, a series of tool-calling reasoning models trained with rule-based reinforcement learning that uses a binary reward to assess format validity and functional correctness, enabling independent reasoning development and outperforming GPT-4o on benchmarks without relying on annotated trajectories.

Authors:John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar
Title: How much do language models memorize?
Abstract:
We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point "grokking" begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.
中文摘要:本研究提出一种区分无意记忆与泛化的新方法,通过量化模型容量发现GPT类模型每参数约存储3.6比特信息,且仅在记忆饱和后才启动泛化过程。
English Summary: This study introduces a method to quantify model capacity by distinguishing unintended memorization from generalization, revealing that GPT-style models store about 3.6 bits per parameter and begin generalizing only after reaching their memorization limit.

Authors:Wenhao Ding, Sushant Veer, Yuxiao Chen, Yulong Cao, Chaowei Xiao, Marco Pavone
Title: RealDrive: Retrieval-Augmented Driving with Diffusion Models
Abstract:
Learning-based planners generate natural human-like driving behaviors by learning to reason about nuanced interactions from data, overcoming the rigid behaviors that arise from rule-based planners. Nonetheless, data-driven approaches often struggle with rare, safety-critical scenarios and offer limited controllability over the generated trajectories. To address these challenges, we propose RealDrive, a Retrieval-Augmented Generation (RAG) framework that initializes a diffusion-based planning policy by retrieving the most relevant expert demonstrations from the training dataset. By interpolating between current observations and retrieved examples through a denoising process, our approach enables fine-grained control and safe behavior across diverse scenarios, leveraging the strong prior provided by the retrieved scenario. Another key insight we produce is that a task-relevant retrieval model trained with planning-based objectives results in superior planning performance in our framework compared to a task-agnostic retriever. Experimental results demonstrate improved generalization to long-tail events and enhanced trajectory diversity compared to standard learning-based planners -- we observe a 40% reduction in collision rate on the Waymo Open Motion dataset with RAG.
中文摘要:RealDrive通过检索增强生成框架,利用专家示范初始化规划策略,在多样化场景中实现精细控制和安全驾驶,在Waymo数据集上碰撞率降低40%。
English Summary: RealDrive is a retrieval-augmented framework that enhances autonomous driving safety by initializing planning policies with relevant expert demonstrations, achieving a 40% collision reduction through improved handling of rare scenarios.

Authors:Derek Everett, Fred Lu, Edward Raff, Fernando Camacho, James Holt
Title: Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms
Abstract:
Canonical algorithms for multi-armed bandits typically assume a stationary reward environment where the size of the action space (number of arms) is small. More recently developed methods typically relax only one of these assumptions: existing non-stationary bandit policies are designed for a small number of arms, while Lipschitz, linear, and Gaussian process bandit policies are designed to handle a large (or infinite) number of arms in stationary reward environments under constraints on the reward function. In this manuscript, we propose a novel policy to learn reward environments over a continuous space using Gaussian interpolation. We show that our method efficiently learns continuous Lipschitz reward functions with $\mathcal{O}^*(\sqrt{T})$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification. We finally demonstrate that our method is computationally favorable (100-10000x faster) and experimentally outperforms sliding Gaussian process policies on datasets with non-stationarity and an extremely large number of arms.
中文: 该方法能高效学习连续Lipschitz奖励函数并达到最优遗憾度,可扩展至非平稳问题,在大规模赌博机问题中显著超越现有方法的速度和性能表现。
English: The proposed method efficiently learns continuous Lipschitz reward functions with optimal regret, extends to non-stationary problems, and significantly outperforms existing approaches in speed and performance for large-scale bandit problems.

Authors:Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
Title: Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
Abstract:
Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs' challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.
Chinese: 本文提出策略以改进自然语言推理中解释的忠实性和鲁棒性,通过优化自动形式化、错误修正、证明草图生成及反馈解析,在多个数据集上显著提升了性能与效率。
English: This paper proposes strategies to enhance the faithfulness and robustness of natural language explanations in Natural Language Inference by improving autoformalisation, error correction, proof sketch generation, and feedback interpretation, achieving significant gains in performance and efficiency across multiple datasets.

Authors:Chen Huang, Skyler Seto, Hadi Pouransari, Mehrdad Farajtabar, Raviteja Vemulapalli, Fartash Faghri, Oncel Tuzel, Barry-John Theobald, Josh Susskind
Title: Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Abstract:
Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of concept forgetting on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method Proxy-FDA that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.
Chinese: 提出的Proxy-FDA方法通过动态代理和邻域图对齐预训练与微调模型间的特征分布结构,有效减少了微调过程中的概念遗忘问题,在多种任务和场景中均优于逐点匹配方法。
English: The proposed Proxy-FDA method reduces concept forgetting during fine-tuning by aligning feature distribution structures between pre-trained and fine-tuned models using dynamic proxies and neighborhood graphs, outperforming point-wise matching approaches across multiple tasks and settings.

Authors:Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Z. Morley Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai Wong
Title: SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving
Abstract:
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: swing-bench.github.io
中文: SwingArena是一个竞争性评估框架,通过将大语言模型配对为补丁提交者和审查者来模拟真实软件开发流程,并采用检索增强模块处理多语言的长上下文代码生成挑战。
English: SwingArena is a competitive evaluation framework that simulates real-world software development by pairing LLMs as patch submitters and reviewers, using a retrieval-augmented module to handle long-context code generation across multiple languages.

Authors:Wenhan Dong, Tianyi Hu, Jingyi Zheng, Zhen Sun, Yuemeng Zhao, Yule Liu, Xinlei He, Xinyi Huang
Title: Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks
Abstract:
Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
中文摘要:本研究揭示了现有评估大语言模型横向思维能力方法的局限性,指出其存在走捷径、模式僵化等问题,并提出了包含推理路径检查、多样化评估指标等改进方案。
English Summary: This study identifies limitations in current methods for evaluating large language models' lateral thinking, revealing issues like shortcut-taking and premature task termination, and proposes refined evaluation standards including reasoning path inspection and human performance comparison.

Authors:Lata Pangtey, Mohammad Zia Ur Rehman, Prasad Chaudhari, Shubhi Bansal, Nagendra Kumar
Title: Emotion-aware Dual Cross-Attentive Neural Network with Label Fusion for Stance Detection in Misinformative Social Media Content
Abstract:
The rapid evolution of social media has generated an overwhelming volume of user-generated content, conveying implicit opinions and contributing to the spread of misinformation. The method aims to enhance the detection of stance where misinformation can polarize user opinions. Stance detection has emerged as a crucial approach to effectively analyze underlying biases in shared information and combating misinformation. This paper proposes a novel method for \textbf{S}tance \textbf{P}rediction through a \textbf{L}abel-fused dual cross-\textbf{A}ttentive \textbf{E}motion-aware neural \textbf{Net}work (SPLAENet) in misinformative social media user-generated content. The proposed method employs a dual cross-attention mechanism and a hierarchical attention network to capture inter and intra-relationships by focusing on the relevant parts of source text in the context of reply text and vice versa. We incorporate emotions to effectively distinguish between different stance categories by leveraging the emotional alignment or divergence between the texts. We also employ label fusion that uses distance-metric learning to align extracted features with stance labels, improving the method's ability to accurately distinguish between stances. Extensive experiments demonstrate the significant improvements achieved by SPLAENet over existing state-of-the-art methods. SPLAENet demonstrates an average gain of 8.92\% in accuracy and 17.36\% in F1-score on the RumourEval dataset. On the SemEval dataset, it achieves average gains of 7.02\% in accuracy and 10.92\% in F1-score. On the P-stance dataset, it demonstrates average gains of 10.03\% in accuracy and 11.18\% in F1-score. These results validate the effectiveness of the proposed method for stance detection in the context of misinformative social media content.
中文摘要:本文提出SPLAENet新型神经网络,通过双重交叉注意力和情感感知机制,在 misinformation 相关的社交媒体内容中显著提升了立场检测的准确率。
English Summary: This paper introduces SPLAENet, a novel neural network that uses dual cross-attention and emotion-aware mechanisms to significantly improve stance detection accuracy in misinformation-related social media content.

Authors:Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du
Title: Inference-time Scaling of Diffusion Models through Classical Search
Abstract:
Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models -- adapting generated outputs to meet diverse test-time objectives -- using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at diffusion-inference-scaling.github.io.
中文摘要:本研究提出了一种结合局部与全局搜索的通用框架,借鉴经典搜索原理来增强扩散模型的推理时控制,在多项任务中实现了性能与效率的显著提升。
English Summary: This study introduces a general framework that integrates local and global search methods, drawing from classical search principles to enhance inference-time control in diffusion models, achieving notable improvements in performance and efficiency across various tasks.

Authors:Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du
Title: Inference-time Scaling of Diffusion Models through Classical Search
Abstract:
Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models -- adapting generated outputs to meet diverse test-time objectives -- using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at https://diffusion-inference-scaling.github.io/.
中文摘要:本研究提出了一种结合局部与全局搜索的通用框架,借鉴经典搜索原理来增强扩散模型的推理时控制,在多项任务中实现了性能与效率的显著提升。
English Summary: This study introduces a general framework that integrates local and global search methods, drawing from classical search principles to enhance inference-time control in diffusion models, achieving notable improvements in performance and efficiency across various tasks.

Authors:Shengyuan Liu, Boyun Zheng, Wenting Chen, Zhihao Peng, Zhenfei Yin, Jing Shao, Jiancong Hu, Yixuan Yuan
Title: EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Abstract:
Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow--spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations--to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.
中文: EndoBench作为首个全面评估多模态大语言模型在多样化内窥镜场景和临床任务中的基准,显示专有模型表现最佳但仍不及人类专家水平。
English: EndoBench is introduced as the first comprehensive benchmark to evaluate multi-modal large language models across diverse endoscopic scenarios and clinical tasks, revealing that proprietary models lead but still fall short of human expertise.

Authors:Shuzhou Sun, Li Liu, Tianpeng Liu, Shuaifeng Zhi, Ming-Ming Cheng, Janne Heikkilä, Yongxiang Liu
Title: A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation
Abstract:
Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.
中文:现有两阶段场景图生成方法因其因果链结构产生虚假关联,导致如将尾部关系误判为头部等偏差,而提出的反向因果框架(RcSGG)通过重构因果关系、采用主动反向估计和最大信息采样来解决此问题,并取得了领先的平均召回率。
English: Current two-stage Scene Graph Generation methods suffer from spurious correlations due to their causal chain structure, leading to biases like misclassifying tail relationships as head ones, which the proposed Reverse causal Framework (RcSGG) addresses by restructuring causality and employing Active Reverse Estimation and Maximum Information Sampling to achieve state-of-the-art results.

Authors:Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye
Title: An Empirical Study of Federated Prompt Learning for Vision Language Model
Abstract:
The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (FL) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning (LPT) and vision prompt learning (VPT) under data heterogeneity challenges, including label skew and domain shift. We conduct extensive experiments to evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Furthermore, we explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist, including leveraging both prompt types when computational resources allow. Our findings offer practical insights into optimizing prompt learning in federated settings, contributing to the broader deployment of VLMs in privacy-preserving environments.
中文摘要:本文系统研究了联邦学习中语言提示学习与视觉提示学习在数据异质性下的行为差异,评估了其鲁棒性并提出了复杂场景下的增强策略,以促进视觉语言模型在隐私保护环境中的部署。
English Summary: This paper systematically explores the behavioral differences between language and vision prompt learning under data heterogeneity in federated learning, evaluating their robustness and proposing enhancement strategies for complex scenarios to facilitate VLM deployment in privacy-preserving environments.

Authors:Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding
Title: Universal Visuo-Tactile Video Understanding for Embodied Interaction
Abstract:
Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.
中文: VTV-LLM是首个融合视觉与触觉视频的多模态大语言模型,通过创新的三阶段训练范式和包含15万帧数据的VTV150K数据集,实现了高级触觉推理能力,为物理交互建立了直观的人机交互基础。
English: VTV-LLM is the first multi-modal large language model that integrates visual and tactile video data to enable advanced tactile reasoning and natural language understanding, addressing the gap in physical object interaction through a novel training framework and comprehensive dataset.

Authors:Tim Engelbracht, Petar Lukovic, Tjark Behrens, Kai Lascheit, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer
Title: Spot-On: A Mixed Reality Interface for Multi-Robot Cooperation
Abstract:
Recent progress in mixed reality (MR) and robotics is enabling increasingly sophisticated forms of human-robot collaboration. Building on these developments, we introduce a novel MR framework that allows multiple quadruped robots to operate in semantically diverse environments via a MR interface. Our system supports collaborative tasks involving drawers, swing doors, and higher-level infrastructure such as light switches. A comprehensive user study verifies both the design and usability of our app, with participants giving a "good" or "very good" rating in almost all cases. Overall, our approach provides an effective and intuitive framework for MR-based multi-robot collaboration in complex, real-world scenarios.
中文: 本文提出了一种创新的混合现实框架,使多个四足机器人能够在多样化环境中通过混合现实界面协作完成任务,用户研究验证了其良好的可用性。
English: This paper presents a novel mixed reality framework enabling multiple quadruped robots to collaboratively perform tasks in diverse environments, validated by a user study showing high usability ratings.

Authors:Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu
Title: Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
Abstract:
We present LrcSSM, a $\textit{non-linear}$ recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing the Jacobian matrix to be diagonal, the full sequence can be solved in parallel, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Importantly, the diagonal Jacobian structure of our model results in no performance loss compared to the original model with dense Jacobian, and the approach can be generalized to other non-linear recurrent models, demonstrating broader applicability. On a suite of long-range forecasting tasks, we demonstrate that LrcSSM outperforms Transformers, LRU, S5, and Mamba.
Chinese: LrcSSM 是一种非线性循环模型,能够并行处理长序列,具备计算效率和梯度稳定性保证,在长程预测任务中表现优于 Transformer 和 Mamba 等现有模型。
English: LrcSSM is a non-linear recurrent model that enables parallel processing of long sequences with computational efficiency and a formal gradient-stability guarantee, outperforming existing models like Transformers and Mamba in long-range forecasting tasks.

Authors:Sungwon Kim, Namkyeong Lee, Yunyoung Doh, Seungmin Shin, Guimok Cho, Seung-Won Jeon, Sangkook Kim, Chanyoung Park
Title: Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks
Abstract:
Mesh-based 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh. In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance or invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.
中文: 本研究提出的厚度感知E(3)等变三维网格神经网络(T-EMNN)通过整合物体厚度信息和数据驱动坐标,有效解决了传统方法忽略实体厚度的局限,在保持计算效率的同时显著提升了三维形变预测的准确性。
English: The proposed Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN) overcomes traditional methods' limitation of ignoring object thickness by incorporating thickness information and data-driven coordinates, achieving superior accuracy in predicting 3D deformations while maintaining computational efficiency.

Authors:Fuhai Wang, Zhe Li, Rujing Xiong, Tiebin Mi, Robert Caiming Qiu
Title: WiCAL: Accurate Wi-Fi-Based 3D Localization Enabled by Collaborative Antenna Arrays
Abstract:
Accurate 3D localization is essential for realizing advanced sensing functionalities in next-generation Wi-Fi communication systems. This study investigates the potential of multistatic localization in Wi-Fi networks through the deployment of multiple cooperative antenna arrays. The collaborative gain offered by these arrays is twofold: (i) intra-array coherent gain at the wavelength scale among antenna elements, and (ii) inter-array cooperative gain across arrays. To evaluate the feasibility and performance of this approach, we develop WiCAL (Wi-Fi Collaborative Antenna Localization), a system built upon commercial Wi-Fi infrastructure equipped with uniform rectangular arrays. These arrays are driven by multiplexing embedded radio frequency chains available in standard access points or user devices, thereby eliminating the need for sophisticated, costly, and power-hungry multi-transceiver modules typically required in multiple-input and multiple-output systems. To address phase offsets introduced by RF chain multiplexing, we propose a three-stage, fine-grained phase alignment scheme to synchronize signals across antenna elements within each array. A bidirectional spatial smoothing MUSIC algorithm is employed to estimate angles of arrival (AoAs) and mitigate performance degradation caused by correlated interference. To further exploit inter-array cooperative gain, we elaborate on the synchronization mechanism among distributed URAs, which enables direct position determination by bypassing intermediate angle estimation. Once synchronized, the distributed URAs effectively form a virtual large-scale array, significantly enhancing spatial resolution and localization accuracy.
中文: 本研究提出WiCAL系统,利用商用Wi-Fi的多天线阵列,通过阵列内相干增益和阵列间协同增益实现精确3D定位,同时解决相位同步和干扰问题以提升定位精度。
English: This study introduces WiCAL, a system leveraging commercial Wi-Fi with multiple antenna arrays to achieve precise 3D localization through intra-array coherent gain and inter-array cooperative gain, while addressing phase synchronization and interference issues for enhanced accuracy.

Authors:Ya-Ting Yang, Quanyan Zhu
Title: PACT: A Contract-Theoretic Framework for Pricing Agentic AI Services Powered by Large Language Models
Abstract:
Agentic AI, often powered by large language models (LLMs), is becoming increasingly popular and adopted to support autonomous reasoning, decision-making, and task execution across various domains. While agentic AI holds great promise, its deployment as services for easy access raises critical challenges in pricing, due to high infrastructure and computation costs, multi-dimensional and task-dependent Quality of Service (QoS), and growing concerns around liability in high-stakes applications. In this work, we propose PACT, a Pricing framework for cloud-based Agentic AI services through a Contract-Theoretic approach, which models QoS along both objective (e.g., response time) and subjective (e.g., user satisfaction) dimensions. PACT accounts for computational, infrastructure, and potential liability costs for the service provider, while ensuring incentive compatibility and individual rationality for the user under information asymmetry. Through contract-based selection, users receive tailored service offerings aligned with their needs. Numerical evaluations demonstrate that PACT improves QoS alignment between users and providers and offers a scalable, liable approach to pricing agentic AI services in the future.
中文: 本文提出PACT这一基于契约理论的云端智能体AI服务定价框架,通过主客观服务质量维度建模,在信息不对称条件下统筹供应商成本与用户激励,实现精准服务匹配。
English: This paper introduces PACT, a contract-theoretic pricing framework for cloud-based agentic AI services that models both objective and subjective QoS dimensions while balancing provider costs and user incentives under information asymmetry.

Authors:Ya-Ting Yang, Quanyan Zhu
Title: When to Deceive: A Cross-Layer Stackelberg Game Framework for Strategic Timing of Cyber Deception
Abstract:
Cyber deception is an emerging proactive defense strategy to counter increasingly sophisticated attacks such as Advanced Persistent Threats (APTs) by misleading and distracting attackers from critical assets. However, since deception techniques incur costs and may lose effectiveness over time, defenders must strategically time and select them to adapt to the dynamic system and the attacker's responses. In this study, we propose a Stackelberg game-based framework to design strategic timing for cyber deception: the lower tactical layer (follower) captures the evolving attacker-defender dynamics under a given deception through a one-sided information Markov game, while the upper strategic layer (leader) employs a stopping-time decision process to optimize the timing and selection of deception techniques. We also introduce a computational algorithm that integrates dynamic programming and belief-state updates to account for the attacker's adaptive behavior and limited deception resources. Numerical experiments validate the framework, showing that strategically timed deceptions can enhance the defender's expected utility and reduce the risk of asset compromise compared to baseline strategies.
中文摘要:本研究提出基于Stack尔伯格博弈的框架,通过双层模型和计算算法来优化网络欺骗技术的实施时机,数值实验证明该策略能有效提升防御效用并降低资产泄露风险。
English Summary: This study introduces a Stackelberg game framework that strategically times cyber deception techniques through a two-layer model and computational algorithm, demonstrating enhanced defense effectiveness against adaptive attackers in numerical experiments.

Authors:Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, Cheng-Lin Liu
Title: SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry
Abstract:
Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce SolidGeo, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry. SolidGeo consists of 3,113 real-world K-12 and competition-level problems, each paired with visual context and annotated with difficulty levels and fine-grained solid geometry categories. Our benchmark covers a wide range of 3D reasoning subjects such as projection, unfolding, spatial measurement, and spatial vector, offering a rigorous testbed for assessing solid geometry. Through extensive experiments, we observe that MLLMs encounter substantial challenges in solid geometry math tasks, with a considerable performance gap relative to human capabilities on SolidGeo. Moreover, we analyze the performance, inference efficiency and error patterns of various models, offering insights into the solid geometric mathematical reasoning capabilities of MLLMs. We hope SolidGeo serves as a catalyst for advancing MLLMs toward deeper geometric reasoning and spatial intelligence.
中文: SolidGeo是首个专门评估多模态大语言模型在立体几何推理能力的大规模基准,包含3,113个现实问题,实验表明现有模型存在显著困难且与人类能力差距较大。
English: SolidGeo is introduced as the first large-scale benchmark for evaluating multimodal large language models' performance in solid geometry, addressing the gap in existing benchmarks by including 3,113 real-world problems and revealing significant challenges and performance gaps compared to human capabilities.

Authors:Jia Li, Jiacheng Shen, Yuxin Su, Michael R. Lyu
Title: ColorGo: Directed Concolic Execution
Abstract:
Directed fuzzing is a critical technique in cybersecurity, targeting specific sections of a program. This approach is essential in various security-related domains such as crash reproduction, patch testing, and vulnerability detection. Despite its importance, current directed fuzzing methods exhibit a trade-off between efficiency and effectiveness. For instance, directed grey-box fuzzing, while efficient in generating fuzzing inputs, lacks sufficient precision. The low precision causes time wasted on executing code that cannot help reach the target site. Conversely, interpreter- or observer-based directed symbolic execution can produce high-quality inputs while incurring non-negligible runtime overhead. These limitations undermine the feasibility of directed fuzzers in real-world scenarios. To kill the birds of efficiency and effectiveness with one stone, in this paper, we involve compilation-based concolic execution into directed fuzzing and present ColorGo, achieving high scalability while preserving the high precision from symbolic execution. ColorGo is a new directed whitebox fuzzer that concretely executes the instrumented program with constraint-solving capability on generated input. It guides the exploration by \textit{incremental coloration}, including static reachability analysis and dynamic feasibility analysis. We evaluated ColorGo on diverse real-world programs and demonstrated that ColorGo outperforms AFLGo by up to \textbf{100x} in reaching target sites and reproducing target crashes.
Chinese Summary: 定向模糊测试面临效率与精度的权衡,而ColorGo通过结合基于编译的混合执行和增量着色技术,在目标可达性和崩溃复现方面实现了高达100倍的性能提升。
English Summary: Directed fuzzing faces a trade-off between efficiency and precision, but ColorGo overcomes this by integrating compilation-based concolic execution with incremental coloration, achieving up to 100x improvement in target reachability and crash reproduction.

Authors:Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, Naoto Yokoya
Title: DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response
Abstract:
Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities.
中文摘要:研究人员开发了DisasterM3遥感数据集,通过多灾种、多传感器和多任务特性解决视觉语言模型在复杂灾害场景中的不足,经微调后显著提升了模型性能。
English Summary: Researchers developed DisasterM3, a comprehensive remote sensing dataset addressing VLM limitations in complex disaster scenarios through multi-hazard, multi-sensor, and multi-task features, which significantly improved model performance when fine-tuned.

Authors:Keheliya Gallaba, Ali Arabat, Dayi Lin, Mohammed Sayagh, Ahmed E. Hassan
Title: Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Abstract:
Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.
中文: 本文提出AlignMind,一种具备心智理论能力的多智能体系统,通过迭代澄清利益相关者需求生成可执行工作流,并经过广泛评估验证其在软件开发中精准捕捉用户意图的有效性。
English: This paper introduces AlignMind, a multi-agent system enhanced with Theory-of-Mind capabilities, which iteratively refines stakeholder requirements into actionable workflows and demonstrates through extensive evaluation its effectiveness in accurately capturing intents for software development.

Authors:Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen
Title: Geometry-Editable and Appearance-Preserving Object Compositon
Abstract:
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.
中文: DGAD模型采用解耦方法,通过语义嵌入实现几何编辑,并利用交叉注意力机制保留细粒度外观细节,在公开基准测试中展现出卓越的物体组合效果。
English: The DGAD model introduces a disentangled approach that uses semantic embeddings for geometry editing and a cross-attention mechanism to preserve fine-grained appearance details, achieving superior object composition results in benchmarks.

Authors:Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
Title: Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Abstract:
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
中文摘要:RLRF方法通过利用渲染输出的视觉保真度反馈,在自回归视觉语言模型中采用强化学习来增强SVG生成,相比监督微调显著提升了准确性和效率。
English Summary: The RLRF method enhances SVG generation in autoregressive vision-language models by using reinforcement learning with visual fidelity feedback from rendered outputs, significantly improving accuracy and efficiency over supervised approaches.

Authors:Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma
Title: DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving
Abstract:
Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.
中文: AutoDriveRL提出了一种统一训练框架,将自动驾驶建模为跨四个核心任务的结构化推理过程,通过任务特定奖励训练出DriveRX视觉语言模型,该模型在行为推理上超越GPT-4o,并在复杂条件下展现出强大鲁棒性。
English: AutoDriveRL introduces a unified training framework that models autonomous driving as a structured reasoning process across four core tasks, using task-specific rewards to train DriveRX, a vision-language model that outperforms GPT-4o in behavior reasoning and demonstrates robustness in complex conditions.

Authors:Yifan Sun, Danding Wang, Qiang Sheng, Juan Cao, Jintao Li
Title: Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery
Abstract:
Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbf{ECO-Concept}, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.
中文: 概念解释性AI方法在文本分析中难以自动生成人类可理解的概念,而ECO-Concept框架通过对象中心架构和大语言模型,无需预定义标注即可发现易于理解的概念,在多项任务中展现出卓越性能。
English: Concept-based explainable AI methods often struggle with automatically generating human-understandable concepts in text analysis, but the proposed ECO-Concept framework overcomes this by using object-centric architecture and large language models to discover comprehensible concepts without predefined annotations, demonstrating superior performance across various tasks.

Authors:Di Wu, Yixin Wan, Kai-Wei Chang
Title: VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Abstract:
Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical and principled path that energizes further advances in vision-language retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.
Chinese: VisRet提出了一种新的文本到图像检索范式,通过先将文本查询转换为图像,然后在图像模态内进行检索,显著提升了检索性能24.5%至32.7%,并有效增强了下游任务的准确性。
English: VisRet introduces a novel Text-to-Image retrieval approach by first converting text queries into images and then retrieving within the image modality, significantly improving performance by 24.5% to 32.7% across benchmarks and benefiting downstream tasks.

Authors:Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang
Title: Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.
中文: Ground-R1是一个无需昂贵标注的强化学习框架,通过证据生成和答案引导两阶段过程,使大型视觉语言模型能够进行基于视觉证据的推理,在多个基准测试中展现出优越性能和涌现的认知能力。
English: Ground-R1 is a reinforcement learning framework that enables large vision-language models to perform grounded visual reasoning without costly annotations, achieving superior performance and emergent cognitive behaviors through a two-phase process of evidence generation and guided response.

Authors:Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao
Title: ARM: Adaptive Reasoning Model
Abstract:
While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.
中文: 自适应推理模型(ARM)通过自适应选择高效推理格式解决了大型推理模型的“过度思考”问题,采用Ada-GRPO训练方法在保持性能的同时实现了最高70%的令牌缩减和训练速度翻倍。
English: The Adaptive Reasoning Model (ARM) is proposed to address the overthinking problem in large reasoning models by adaptively selecting efficient reasoning formats, achieving up to 70% token reduction and doubled training speed while maintaining performance through the Ada-GRPO training method.

Authors:Run Gu, Wei Xu, Zhaohui Yang, Dusit Niyato, Aylin Yener
Title: Task-Oriented Low-Label Semantic Communication With Self-Supervised Learning
Abstract:
Task-oriented semantic communication enhances transmission efficiency by conveying semantic information rather than exact messages. Deep learning (DL)-based semantic communication can effectively cultivate the essential semantic knowledge for semantic extraction, transmission, and interpretation by leveraging massive labeled samples for downstream task training. In this paper, we propose a self-supervised learning-based semantic communication framework (SLSCom) to enhance task inference performance, particularly in scenarios with limited access to labeled samples. Specifically, we develop a task-relevant semantic encoder using unlabeled samples, which can be collected by devices in real-world edge networks. To facilitate task-relevant semantic extraction, we introduce self-supervision for learning contrastive features and formulate the information bottleneck (IB) problem to balance the tradeoff between the informativeness of the extracted features and task inference performance. Given the computational challenges of the IB problem, we devise a practical and effective solution by employing self-supervised classification and reconstruction pretext tasks. We further propose efficient joint training methods to enhance end-to-end inference accuracy over wireless channels, even with few labeled samples. We evaluate the proposed framework on image classification tasks over multipath wireless channels. Extensive simulation results demonstrate that SLSCom significantly outperforms conventional digital coding methods and existing DL-based approaches across varying labeled data set sizes and SNR conditions, even when the unlabeled samples are irrelevant to the downstream tasks.
中文: 本文提出了一种自监督语义通信框架(SLSCom),通过利用未标记数据进行对比特征学习和信息瓶颈优化来提升任务推理性能,在图像分类任务中,在不同数据集规模和信噪比条件下均显著优于传统数字编码和现有深度学习方法。
English: This paper introduces a self-supervised semantic communication framework (SLSCom) that enhances task inference performance by leveraging unlabeled data for contrastive feature learning and information bottleneck optimization, demonstrating superior results over existing methods in image classification tasks under various data and channel conditions.

Authors:Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Shouling Ji
Title: Poison in the Well: Feature Embedding Disruption in Backdoor Attacks
Abstract:
Backdoor attacks embed malicious triggers into training data, enabling attackers to manipulate neural network behavior during inference while maintaining high accuracy on benign inputs. However, existing backdoor attacks face limitations manifesting in excessive reliance on training data, poor stealth, and instability, which hinder their effectiveness in real-world applications. Therefore, this paper introduces ShadowPrint, a versatile backdoor attack that targets feature embeddings within neural networks to achieve high ASRs and stealthiness. Unlike traditional approaches, ShadowPrint reduces reliance on training data access and operates effectively with exceedingly low poison rates (as low as 0.01%). It leverages a clustering-based optimization strategy to align feature embeddings, ensuring robust performance across diverse scenarios while maintaining stability and stealth. Extensive evaluations demonstrate that ShadowPrint achieves superior ASR (up to 100%), steady CA (with decay no more than 1% in most cases), and low DDR (averaging below 5%) across both clean-label and dirty-label settings, and with poison rates ranging from as low as 0.01% to 0.05%, setting a new standard for backdoor attack capabilities and emphasizing the need for advanced defense strategies focused on feature space manipulations.
Chinese: ShadowPrint提出了一种针对神经网络特征嵌入的新型后门攻击方法,通过极低的数据依赖性和仅0.01%的污染率实现了高攻击成功率与隐蔽性,并在各种场景下保持稳定性能。
English: ShadowPrint introduces a novel backdoor attack that targets neural network feature embeddings, achieving high attack success rates and stealth with minimal data reliance and poison rates as low as 0.01%, while maintaining robust performance across various settings.

Authors:Alou Diakite, Cheng Li, Lei Xie, Yuanjing Feng, Ruoyou Wu, Jianzhong He, Hairong Zheng, Shanshan Wang
Title: Cross-Sequence Semi-Supervised Learning for Multi-Parametric MRI-Based Visual Pathway Delineation
Abstract:
Accurately delineating the visual pathway (VP) is crucial for understanding the human visual system and diagnosing related disorders. Exploring multi-parametric MR imaging data has been identified as an important way to delineate VP. However, due to the complex cross-sequence relationships, existing methods cannot effectively model the complementary information from different MRI sequences. In addition, these existing methods heavily rely on large training data with labels, which is labor-intensive and time-consuming to obtain. In this work, we propose a novel semi-supervised multi-parametric feature decomposition framework for VP delineation. Specifically, a correlation-constrained feature decomposition (CFD) is designed to handle the complex cross-sequence relationships by capturing the unique characteristics of each MRI sequence and easing the multi-parametric information fusion process. Furthermore, a consistency-based sample enhancement (CSE) module is developed to address the limited labeled data issue, by generating and promoting meaningful edge information from unlabeled data. We validate our framework using two public datasets, and one in-house Multi-Shell Diffusion MRI (MDM) dataset. Experimental results demonstrate the superiority of our approach in terms of delineation performance when compared to seven state-of-the-art approaches.
中文: 本研究提出了一种半监督多参数特征分解框架,通过相关性约束特征分解和基于一致性的样本增强技术,有效解决视觉通路描绘中的跨序列复杂性和标注数据不足问题,实验证明其性能优于七种先进方法。
English: This study introduces a semi-supervised multi-parametric feature decomposition framework for accurate visual pathway delineation, addressing cross-sequence complexity and limited labeled data through correlation-constrained feature decomposition and consistency-based sample enhancement, outperforming seven state-of-the-art methods in experiments.

Authors:Amira Guesmi, Bassem Ouni, Muhammad Shafique
Title: TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization
Abstract:
Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbf{TESSER} -- a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9\% higher attack succes rate (ASR) on CNNs and +7.2\% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55\% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER's perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12\% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.
中文:TESSER是一种新颖的对抗攻击框架,通过特征敏感梯度缩放和频谱平滑正则化增强迁移性,相比现有最优方法,在多种架构上实现了显著更高的攻击成功率。
English: TESSER is a novel adversarial attack framework that enhances transferability through feature-sensitive gradient scaling and spectral smoothness regularization, achieving significantly higher attack success rates across diverse architectures compared to state-of-the-art methods.

Authors:Pieter van Goor, Robert Mahony
Title: Synchronous Models and Fundamental Systems in Observer Design
Abstract:
This paper introduces the concept of a synchronous model as an extension of the internal model concept used in observer design for dynamical systems. A system is said to contain a synchronous model of another if there is a suitable error function between the two systems that remains stationary for all of the trajectories of the two systems. A system is said to admit a synchronous lift if a second system containing a synchronous model exists. We provide necessary and sufficient conditions that a system admits a synchronous lift and provide a method to construct a (there may be many) lifted system should one exist. We characterise the class of all systems that admit a synchronous lift by showing that they consist of fundamental vector fields induced by a Lie group action, a class of system we term fundamental systems. For fundamental systems we propose a simple synchronous observer design methodology, for which we show how correction terms can be discretised and combined easily, facilitating global characterisation of convergence and performance. Finally, we provide three examples to demonstrate the key concepts of synchrony, symmetry construction, and observer design for a fundamental system.
中文摘要:本文将内模概念扩展为动力系统观测器设计中的同步模型,建立了系统允许同步提升的充要条件,并将其特征化为由李群作用诱导的基本系统。
English Summary: This paper extends the internal model concept to synchronous models for observer design in dynamical systems, establishing necessary and sufficient conditions for systems to admit synchronous lifts and characterizing them as fundamental systems induced by Lie group actions.

Authors:Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun
Title: Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals
Abstract:
Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.
中文: 本研究提出力提示作为视频生成的控制机制,无需3D资源或模拟器即可实现推搡、风吹等逼真物理交互,并通过少量Blender合成训练数据展现出强大泛化能力,在物理真实性上超越现有方法。
English: This study introduces force prompts as a control mechanism for video generation, enabling realistic physical interactions like poking or wind effects without 3D assets or simulators, and demonstrates strong generalization from limited Blender-synthesized training data while outperforming existing methods in physics realism.

Authors:Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
Title: A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Abstract:
Online reinforcement learning (RL) excels in complex, safety-critical domains, yet it faces challenges such as sample inefficiency, training instability, and a lack of interpretability. Data attribution offers a principled way to trace model behavior back to individual training samples. However, in online RL, each training sample not only drives policy updates but also influences future data collection, violating the fixed dataset assumption in existing attribution methods. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a local attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Overall, these results advance interpretability, efficiency, and effectiveness of online RL.
在线强化学习面临样本效率和可解释性挑战,但本研究通过数据归因框架和迭代过滤算法,在多类基准测试中提升了训练性能。
Online reinforcement learning faces challenges in sample efficiency and interpretability, but this study introduces a data attribution framework and an iterative filtering algorithm that enhance training performance across various benchmarks.

Authors:Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
Title: A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Abstract:
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.
在线强化学习面临样本效率和可解释性挑战,但本研究通过数据归因框架和迭代过滤算法,在多类基准测试中提升了训练性能。
Online reinforcement learning faces challenges in sample efficiency and interpretability, but this study introduces a data attribution framework and an iterative filtering algorithm that enhance training performance across various benchmarks.

Authors:Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low
Title: ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Abstract:
The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.
中文: ActiveDPO是一种新颖算法,通过基于理论、由大语言模型参数化的奖励模型进行高效主动数据选择,从而提升大语言模型的对齐效果,在多种模型和数据集上均优于现有方法。
English: ActiveDPO is a novel algorithm that enhances large language model alignment by using a theoretically grounded, LLM-parameterized reward model for efficient active data selection, outperforming existing methods across diverse models and datasets.

Authors:Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
Title: Plug-and-Play Context Feature Reuse for Efficient Masked Generation
Abstract:
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.
Chinese: ReCAP作为一种即插即用模块,通过重用上下文特征实现轻量级解码步骤,将掩码生成模型的推理速度提升高达2.4倍,同时保持生成质量基本不变。
English: ReCAP is a plug-and-play module that accelerates masked generative models by reusing context features in lightweight decoding steps, achieving up to 2.4x faster inference with minimal quality loss.

Authors:Catalina Tan, Yipeng Hu, Shaheer U. Saeed
Title: SPARS: Self-Play Adversarial Reinforcement Learning for Segmentation of Liver Tumours
Abstract:
Accurate tumour segmentation is vital for various targeted diagnostic and therapeutic procedures for cancer, e.g., planning biopsies or tumour ablations. Manual delineation is extremely labour-intensive, requiring substantial expert time. Fully-supervised machine learning models aim to automate such localisation tasks, but require a large number of costly and often subjective 3D voxel-level labels for training. The high-variance and subjectivity in such labels impacts model generalisability, even when large datasets are available. Histopathology labels may offer more objective labels but the infeasibility of acquiring pixel-level annotations to develop tumour localisation methods based on histology remains challenging in-vivo. In this work, we propose a novel weakly-supervised semantic segmentation framework called SPARS (Self-Play Adversarial Reinforcement Learning for Segmentation), which utilises an object presence classifier, trained on a small number of image-level binary cancer presence labels, to localise cancerous regions on CT scans. Such binary labels of patient-level cancer presence can be sourced more feasibly from biopsies and histopathology reports, enabling a more objective cancer localisation on medical images. Evaluating with real patient data, we observed that SPARS yielded a mean dice score of $77.3 \pm 9.4$, which outperformed other weakly-supervised methods by large margins. This performance was comparable with recent fully-supervised methods that require voxel-level annotations. Our results demonstrate the potential of using SPARS to reduce the need for extensive human-annotated labels to detect cancer in real-world healthcare settings.
中文摘要:SPARS框架通过弱监督学习仅使用简单的二元癌症存在标签即可在CT扫描中准确定位肿瘤,其性能与需要体素级标注的全监督方法相当,同时显著降低了对人工标注的依赖。
English Summary: The SPARS framework uses weakly-supervised learning with simple binary cancer presence labels to accurately localize tumors on CT scans, achieving performance comparable to fully-supervised methods while reducing reliance on costly manual annotations.

Authors:Rohit Khoja, Devanshu Gupta, Yanjie Fu, Dan Roth, Vivek Gupta
Title: Weaver: Interweaving SQL and LLM for Table Reasoning
Abstract:
Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates. The code, along with other associated scripts, are available at https://coral-lab-asu.github.io/weaver.
Chinese: Weaver提出了一种模块化流程,通过动态整合SQL和大型语言模型,将复杂查询分解为可处理的子任务,从而在提高准确性和泛化能力的同时减少了API调用和错误率。
English: Weaver introduces a modular pipeline that dynamically combines SQL and LLMs to handle table-based question answering by decomposing complex queries into manageable steps, improving accuracy and generalization while reducing API calls and errors.

Authors:Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao
Title: Behavior Injection: Preparing Language Models for Reinforcement Learning
Abstract:
Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increases the performance gain from RFT over the pre-RL model.
中文摘要:强化学习虽能提升大语言模型的推理能力,但效果不一;本研究揭示了其有效性的关键条件,并提出行为注入这一数据增强方法,通过预先优化模型准备,显著提高了强化学习的性能增益。
English Summary: Reinforcement learning (RL) can enhance large language models' reasoning, but its effectiveness varies; this study identifies key conditions for success and introduces behavior injection, a data augmentation method that significantly boosts RL performance by preparing models better beforehand.

Authors:Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao
Title: Behavior Injection: Preparing Language Models for Reinforcement Learning
Abstract:
Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.
中文摘要:强化学习虽能提升大语言模型的推理能力,但效果不一;本研究揭示了其有效性的关键条件,并提出行为注入这一数据增强方法,通过预先优化模型准备,显著提高了强化学习的性能增益。
English Summary: Reinforcement learning (RL) can enhance large language models' reasoning, but its effectiveness varies; this study identifies key conditions for success and introduces behavior injection, a data augmentation method that significantly boosts RL performance by preparing models better beforehand.

Authors:Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito
Title: C3R: Channel Conditioned Cell Representations for unified evaluation in microscopy imaging
Abstract:
Immunohistochemical (IHC) images reveal detailed information about structures and functions at the subcellular level. However, unlike natural images, IHC datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Existing approaches build channel-adaptive models, which unfortunately fail to support out-of-distribution (OOD) evaluation across IHC datasets and cannot be applied in a true zero-shot setting with mismatched channel counts. To address this, we introduce a structured view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference to the concept channels in the image. We leverage this context-concept principle to develop Channel Conditioned Cell Representations (C3R), a framework designed for unified evaluation on in-distribution (ID) and OOD datasets. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while a trivial implementation of our core idea also outperforms the channel-adaptive methods reported on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IHC datasets, without requiring dataset-specific adaptation or retraining.
中文: 本文提出C3R框架,通过将细胞图像通道分组为背景和概念类别,实现了免疫组化数据在分布内和分布外的统一评估,无需特定数据集适配即可超越现有方法性能。
English: This paper introduces C3R, a novel framework that groups cellular image channels into context and concept categories to enable unified evaluation across in-distribution and out-of-distribution immunohistochemical datasets, outperforming existing methods without requiring dataset-specific adaptations.

Authors:Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu
Title: Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
Abstract:
Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.
中文: 提出的MTBI方法通过多任务行为模仿和语音-文本交错训练,仅需少量监督数据即可实现更优的语音-语言模型对齐与泛化能力。
English: The proposed MTBI method enhances speech-language model alignment by using multi-task behavior imitation with speech-text interleaving, achieving superior generalization with less supervised data than current models.

Authors:Yiming Sun, Shuo Chen, Shengyu Chen, Chonghao Qiu, Licheng Liu, Youmi Oh, Sparkle L. Malone, Gavin McNicol, Qianlai Zhuang, Chris Smith, Yiqun Xie, Xiaowei Jia
Title: X-MethaneWet: A Cross-scale Global Wetland Methane Emission Benchmark Dataset for Advancing Science Discovery with AI
Abstract:
Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH$_4$ fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of-its-kind cross-scale global wetland methane benchmark dataset (X-MethaneWet), which synthesizes physics-based model simulation data from TEM-MDM and the real-world observation data from FLUXNET-CH$_4$. This dataset can offer opportunities for improving global wetland CH$_4$ modeling and science discovery with new AI algorithms. To set up AI model baselines for methane flux prediction, we evaluate the performance of various sequential deep learning models on X-MethaneWet. Furthermore, we explore four different transfer learning techniques to leverage simulated data from TEM-MDM to improve the generalization of deep learning models on real-world FLUXNET-CH$_4$ observations. Our extensive experiments demonstrate the effectiveness of these approaches, highlighting their potential for advancing methane emission modeling and contributing to the development of more accurate and scalable AI-driven climate models.
中文: 本研究首次推出跨尺度全球湿地甲烷基准数据集(X-MethaneWet),通过整合模型模拟与实地观测数据,评估了深度学习模型和迁移学习技术在提升甲烷通量预测能力方面的有效性,为推进气候变化建模提供新途径。
English: This study introduces the first cross-scale global wetland methane benchmark dataset (X-MethaneWet), combining model simulations and field observations to evaluate deep learning models and transfer learning techniques for improving methane flux predictions and advancing climate modeling.

Authors:Ya Wu, Qiang Sheng, Danding Wang, Guang Yang, Yifan Sun, Zhengjia Wang, Yuyan Bu, Juan Cao
Title: The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas
Abstract:
Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
中文摘要:本研究引入多步骤道德困境(MMDs)数据集,动态评估大型语言模型的道德推理演变,发现其价值偏好随情境复杂度和语境发生显著变化,表明需采用更灵活的动态评估范式来推动符合人类价值观的AI发展。
English Summary: The study introduces Multi-step Moral Dilemmas (MMDs) to dynamically evaluate LLMs' evolving moral reasoning, revealing that their value preferences shift with scenario complexity and context, necessitating more adaptive evaluation methods for ethical AI development.

Authors:Poojah Ganesan, Rajat Aayush Jha, Dan Roth, Vivek Gupta
Title: UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification
Abstract:
Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.
中文:UNJOIN框架通过将模式元素检索与SQL逻辑生成解耦,改进了多表数据库的文本到SQL性能,无需数据访问或微调即可达到最先进水平。
English: The UNJOIN framework improves Text-to-SQL performance for multi-table databases by decoupling schema element retrieval from SQL logic generation, achieving state-of-the-art results without requiring data access or fine-tuning.

Authors:Lynn Karam, Yipei Wang, Veeru Kasivisvanathan, Mirabela Rusu, Yipeng Hu, Shaheer U. Saeed
Title: Promptable cancer segmentation using minimal expert-curated data
Abstract:
Automated segmentation of cancer on medical images can aid targeted diagnostic and therapeutic procedures. However, its adoption is limited by the high cost of expert annotations required for training and inter-observer variability in datasets. While weakly-supervised methods mitigate some challenges, using binary histology labels for training as opposed to requiring full segmentation, they require large paired datasets of histology and images, which are difficult to curate. Similarly, promptable segmentation aims to allow segmentation with no re-training for new tasks at inference, however, existing models perform poorly on pathological regions, again necessitating large datasets for training. In this work we propose a novel approach for promptable segmentation requiring only 24 fully-segmented images, supplemented by 8 weakly-labelled images, for training. Curating this minimal data to a high standard is relatively feasible and thus issues with the cost and variability of obtaining labels can be mitigated. By leveraging two classifiers, one weakly-supervised and one fully-supervised, our method refines segmentation through a guided search process initiated by a single-point prompt. Our approach outperforms existing promptable segmentation methods, and performs comparably with fully-supervised methods, for the task of prostate cancer segmentation, while using substantially less annotated data (up to 100X less). This enables promptable segmentation with very minimal labelled data, such that the labels can be curated to a very high standard.
Chinese: 本研究提出了一种新颖的可提示分割方法,用于前列腺癌分割,仅需24张全分割和8张弱标记图像进行训练,通过双分类器的引导搜索优化分割效果,在显著减少标注数据量(高达100倍)的同时,性能与全监督方法相当。
English: This study introduces a novel promptable segmentation method for prostate cancer that requires only minimal training data—24 fully segmented and 8 weakly labeled images—leveraging dual classifiers to refine segmentation via guided search, achieving performance comparable to fully supervised methods while using up to 100 times less annotated data.

Authors:Marcel Binz, Akshay K. Jagadish, Milena Rmus, Eric Schulz
Title: Automated scientific minimization of regret
Abstract:
We introduce automated scientific minimization of regret (ASMR) -- a framework for automated computational cognitive science. Building on the principles of scientific regret minimization, ASMR leverages Centaur -- a recently proposed foundation model of human cognition -- to identify gaps in an interpretable cognitive model. These gaps are then addressed through automated revisions generated by a language-based reasoning model. We demonstrate the utility of this approach in a multi-attribute decision-making task, showing that ASMR discovers cognitive models that predict human behavior at noise ceiling while retaining interpretability. Taken together, our results highlight the potential of ASMR to automate core components of the cognitive modeling pipeline.
中文摘要:ASMR是一种自动化框架,利用认知模型和人工智能来识别并填补可解释认知科学模型中的空白,在预测人类行为的同时保持了模型的清晰度,展现了其有效性。
English Summary: ASMR is an automated framework that uses cognitive models and AI to identify and fill gaps in interpretable cognitive science models, demonstrating effectiveness in predicting human behavior while maintaining clarity.

Authors:Li Lin, Xinyu Hu, Xiaojun Wan
Title: NeUQI: Near-Optimal Uniform Quantization Parameter Initialization
Abstract:
Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.
中文摘要:大语言模型在消费级设备上部署时面临高资源消耗的挑战,而NeUQI方法通过优化量化参数初始化,在显著降低内存和延迟的同时保持了模型性能。
English Summary: Large language models face deployment challenges on consumer devices due to high resource demands, but the proposed NeUQI method efficiently optimizes quantization initialization to significantly reduce memory and latency while maintaining performance.

Authors:Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Zeren Luo, Jingyi Zheng, Wenhan Dong, Xinlei He, Xuechao Wang, Yingjie Xue, Shengmin Xu, Xinyi Huang
Title: JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
Abstract:
Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, the \textit{first} comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 2,200 text samples and 51,381 audio samples with over 268 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and attack representations. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.
中文: 本文提出了JALMBench这一综合性基准测试,用于评估音频语言模型在越狱攻击下的安全性,包含大规模数据集并支持多种模型、攻击和防御方法,以分析漏洞并探索缓解策略。
English: This paper introduces JALMBench, a comprehensive benchmark for evaluating the security of Audio Language Models against jailbreak attacks, featuring a large dataset and supporting multiple models, attacks, and defenses to analyze vulnerabilities and mitigation strategies.

Authors:Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Zeren Luo, Jingyi Zheng, Wenhan Dong, Xinlei He, Xuechao Wang, Yingjie Xue, Shengmin Xu, Xinyi Huang
Title: JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
Abstract:
Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, a comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 11,316 text samples and 245,355 audio samples with over 1,000 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and architecture. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.
中文: 本文提出了JALMBench这一综合性基准测试,用于评估音频语言模型在越狱攻击下的安全性,包含大规模数据集并支持多种模型、攻击和防御方法,以分析漏洞并探索缓解策略。
English: This paper introduces JALMBench, a comprehensive benchmark for evaluating the security of Audio Language Models against jailbreak attacks, featuring a large dataset and supporting multiple models, attacks, and defenses to analyze vulnerabilities and mitigation strategies.

Authors:Cheng Peng, Kai Zhang, Mengxian Lyu, Hongfang Liu, Lichao Sun, Yonghui Wu
Title: Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning
Abstract:
To advance biomedical vison-language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance. We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks including one image-only task (image classification), three language-only tasks (text understanding, text summarization and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.
中文: 本研究开发了两种先进的生物医学视觉语言模型BiomedGPT-Large和BiomedGPT-XLarge,通过在六个多模态任务上进行微调,并利用指令调优评估了其零样本学习性能。
English: This research developed two advanced biomedical vision-language models, BiomedGPT-Large and BiomedGPT-XLarge, which were fine-tuned across six multimodal tasks and evaluated for zero-shot learning performance through instruction tuning.

Authors:Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, Yao Yao
Title: Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Abstract:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.
中文: Direct3D-S2提出了一种基于稀疏体积和空间稀疏注意力的可扩展3D生成框架,以显著降低的计算成本实现更优质量,并能够用少量GPU完成高分辨率训练。
English: Direct3D-S2 introduces a scalable 3D generation framework using sparse volumes and Spatial Sparse Attention to achieve superior quality with dramatically reduced computational costs, enabling high-resolution training with minimal GPU resources.

Authors:Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
Title: Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Abstract:
Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.
中文: 多模态大语言模型在微调过程中易受后门攻击,而提出的“眼见为实”框架通过分析注意力熵模式,无需干净数据或模型修改即可有效检测并过滤恶意样本,确保模型安全。
English: Multimodal Large Language Models (MLLMs) are vulnerable to backdoor attacks during fine-tuning, but the proposed Believe Your Eyes (BYE) framework effectively detects and filters malicious samples by analyzing attention entropy patterns without requiring clean data or model modifications.

Authors:Mohamed Amine Ketata, David Lüdke, Leo Schwinn, Stephan Günnemann
Title: Joint Relational Database Generation via Graph-Conditional Diffusion Models
Abstract:
Building generative models for relational databases (RDBs) is important for applications like privacy-preserving data release and augmenting real datasets. However, most prior work either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially. This approach limits parallelism, restricts flexibility in downstream applications like missing value imputation, and compounds errors due to commonly made conditional independence assumptions. We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM). GRDM leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.
Chinese: 本文提出图条件关系扩散模型(GRDM),通过图神经网络联合建模关系数据库以捕捉表间依赖关系,无需预设表顺序,在多跳关联建模上显著优于自回归基线,并在单表保真度指标上达到最优性能。
English: This paper introduces the Graph-Conditional Relational Diffusion Model (GRDM), which jointly models relational databases using a graph neural network to capture inter-table dependencies without imposing table order, significantly outperforming autoregressive baselines in multi-hop correlation modeling and achieving state-of-the-art single-table fidelity.

Authors:Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan
Title: TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition
Abstract:
TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.
中文: TAT-VPR是一种三值量化变换器,通过融合三值权重与学习激活稀疏门控,可在视觉SLAM回环检测中实现动态精度-效率平衡,运行时计算量最高减少40%且保持性能不变。
English: TAT-VPR is a ternary-quantized transformer enabling dynamic accuracy-efficiency trade-offs in visual SLAM loop-closure, achieving up to 40% computation reduction without performance loss through ternary weights and activation-sparsity gates.

Authors:Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Title: DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.
中文: DriveMoE提出了一种新颖的专家混合框架,通过动态适配驾驶场景的视觉模块和行为专项化的动作模块,在复杂驾驶环境中实现了最先进的端到端自动驾驶性能。
English: DriveMoE introduces a novel Mixture-of-Experts framework for end-to-end autonomous driving, featuring specialized vision and action modules that dynamically adapt to driving contexts and behaviors, achieving state-of-the-art performance in complex scenarios.

Authors:Shuchang Ye, Usman Naseem, Mingyuan Meng, Dagan Feng, Jinman Kim
Title: MedCFVQA: A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering
Abstract:
Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians' inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
中文:MedCFVQA模型通过因果推理和重构数据集解决了医学视觉问答中的模态偏好偏差问题,在原始和修改后的基准测试中均表现出卓越性能。
English: The MedCFVQA model overcomes modality preference bias in MedVQA by using causal inference and reconstructed datasets, achieving superior performance on both original and modified benchmarks.

Authors:Jiahuan Long, Wenzhe Zhang, Ning Wang, Tingsong Jiang, Wen Yao
Title: FR-Mamba: Time-Series Physical Field Reconstruction Based on State Space Model
Abstract:
Physical field reconstruction (PFR) aims to predict the state distribution of physical quantities (e.g., velocity, pressure, and temperature) based on limited sensor measurements. It plays a critical role in domains such as fluid dynamics and thermodynamics. However, existing deep learning methods often fail to capture long-range temporal dependencies, resulting in suboptimal performance on time-evolving physical systems. To address this, we propose FR-Mamba, a novel spatiotemporal flow field reconstruction framework based on state space modeling. Specifically, we design a hybrid neural network architecture that combines Fourier Neural Operator (FNO) and State Space Model (SSM) to capture both global spatial features and long-range temporal dependencies. We adopt Mamba, a recently proposed efficient SSM architecture, to model long-range temporal dependencies with linear time complexity. In parallel, the FNO is employed to capture non-local spatial features by leveraging frequency-domain transformations. The spatiotemporal representations extracted by these two components are then fused to reconstruct the full-field distribution of the physical system. Extensive experiments demonstrate that our approach significantly outperforms existing PFR methods in flow field reconstruction tasks, achieving high-accuracy performance on long sequences.
物理场重建(PFR)旨在基于稀疏传感器数据预测物理量分布,提出的FR-Mamba框架结合FNO和SSM有效捕捉时空依赖关系,在长序列任务中显著超越现有方法的精度表现。
Physical field reconstruction (PFR) predicts physical quantity distributions from sparse sensor data, and the proposed FR-Mamba framework combines FNO and SSM to effectively capture spatiotemporal dependencies, significantly outperforming existing methods in accuracy for long sequences.

Authors:Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong
Title: PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Abstract:
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.
Chinese: 现有AI模型在物理推理方面存在显著不足,新型PhyX基准测试显示顶尖模型准确率仅为32.5%-45.8%,远低于人类专家水平,暴露出其过度依赖记忆知识和表面模式匹配而非真正物理理解的缺陷。
English: Current AI models show significant limitations in physical reasoning, as demonstrated by the new PhyX benchmark where top-performing models achieved only 32.5-45.8% accuracy compared to human experts, revealing their over-reliance on memorized knowledge and superficial pattern matching rather than genuine understanding.

Authors:Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein
Title: Interspatial Attention for Efficient 4D Human Video Generation
Abstract:
Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.
中文: 本文提出了一种空间交叉注意力机制,结合扩散变换器模型,实现了高质量、可控的4D人物视频生成,在动作连贯性和身份保持方面表现卓越。
English: This paper introduces an interspatial attention mechanism integrated into diffusion transformer models to generate high-quality, controllable 4D human videos with superior motion consistency and identity preservation.

Authors:Zhen Sun, Ziyi Zhang, Zeren Luo, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang
Title: FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models
Abstract:
Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
中文: 本研究推出了首个局部图像编辑检测基准数据集FragFake,并开创性地将视觉语言模型应用于编辑检测任务,通过重构为视觉-语言理解问题,在识别篡改内容方面实现了更优性能。
English: This study introduces FragFake, the first benchmark dataset for detecting localized image edits, and pioneers the use of Vision Language Models to reformulate edit detection as a vision-language task, achieving superior performance in identifying manipulated content.

Authors:Die Chen, Zhiwen Li, Cen Chen, Yuexiang Xie, Xiaodan Li, Jinyan Ye, Yingda Chen, Yaliang Li
Title: Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
Abstract:
Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
中文: 本文针对文本到图像扩散模型中的NSFW内容风险,提出了首个全流程概念消除工具包,通过系统研究为不同实际场景提供深入见解和应用指导,以加强模型安全部署。
English: This paper introduces a comprehensive toolkit for evaluating NSFW concept erasure methods in text-to-image diffusion models, providing systematic insights and practical guidance to enhance content safety across diverse applications.

Authors:Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, Chong Luo
Title: ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Abstract:
Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15\% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.
Chinese: ViaRL提出了首个基于规则的强化学习框架,通过以下游模型答案准确度作为奖励信号来优化意图驱动的视频理解中的帧选择,无需昂贵标注即可在多项基准测试中实现卓越的时序定位性能与泛化能力。
English: ViaRL introduces a rule-based reinforcement learning framework that optimizes frame selection for intention-driven video understanding by using downstream answer accuracy as a reward signal, eliminating costly annotations while demonstrating superior performance across multiple benchmarks.

Authors:Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu
Title: FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
Abstract:
Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.
Chinese: FlowKV采用多轮隔离机制,仅对新生成的关键值对进行压缩,避免历史对话信息的重复压缩,从而将指令遵循准确率从10.90%提升至75.40%,有效缓解上下文遗忘问题。
English: FlowKV introduces a multi-turn isolation mechanism that prevents re-compression of past dialogue context, significantly improving performance by preserving accumulated KV cache and applying compression only to new turns, boosting accuracy from 10.90% to 75.40%.

Authors:Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu
Title: FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
Abstract:
Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.
Chinese: FlowKV采用多轮隔离机制,仅对新生成的关键值对进行压缩,避免历史对话信息的重复压缩,从而将指令遵循准确率从10.90%提升至75.40%,有效缓解上下文遗忘问题。
English: FlowKV introduces a multi-turn isolation mechanism that prevents re-compression of past dialogue context, significantly improving performance by preserving accumulated KV cache and applying compression only to new turns, boosting accuracy from 10.90% to 75.40%.

Authors:Ji Guo, Xiaolei Wen, Wenbo Jiang, Cheng Huang, Jinjin Li, Hongwei Li
Title: BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution
Abstract:
With the widespread application of super-resolution (SR) in various fields, researchers have begun to investigate its security. Previous studies have demonstrated that SR models can also be subjected to backdoor attacks through data poisoning, affecting downstream tasks. A backdoor SR model generates an attacker-predefined target image when given a triggered image while producing a normal high-resolution (HR) output for clean images. However, prior backdoor attacks on SR models have primarily focused on the stealthiness of poisoned low-resolution (LR) images while ignoring the stealthiness of poisoned HR images, making it easy for users to detect anomalous data. To address this problem, we propose BadSR, which improves the stealthiness of poisoned HR images. The key idea of BadSR is to approximate the clean HR image and the pre-defined target image in the feature space while ensuring that modifications to the clean HR image remain within a constrained range. The poisoned HR images generated by BadSR can be integrated with existing triggers. To further improve the effectiveness of BadSR, we design an adversarially optimized trigger and a backdoor gradient-driven poisoned sample selection method based on a genetic algorithm. The experimental results show that BadSR achieves a high attack success rate in various models and data sets, significantly affecting downstream tasks.
中文: BadSR通过在特征空间中逼近干净图像与目标图像,并采用优化触发器和选择方法,显著提升了超分辨率模型中中毒高分辨率图像的隐蔽性,实现了跨模型和数据集的较高攻击成功率。
English: BadSR enhances the stealthiness of poisoned high-resolution images in super-resolution models by approximating clean and target images in feature space, using optimized triggers and selection methods to achieve high attack success rates across models and datasets.

Authors:Yuxuan Du, Zhendong Wang, Yuhao Luo, Caiyong Piao, Zhiyuan Yan, Hao Li, Li Yuan
Title: CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation
Abstract:
The rapid emergence of multimodal deepfakes (visual and auditory content are manipulated in concert) undermines the reliability of existing detectors that rely solely on modality-specific artifacts or cross-modal inconsistencies. In this work, we first demonstrate that modality-specific forensic traces (e.g., face-swap artifacts or spectral distortions) and modality-shared semantic misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and that neglecting either aspect limits detection performance. Existing approaches either naively fuse modality-specific features without reconciling their conflicting characteristics or focus predominantly on semantic misalignment at the expense of modality-specific fine-grained artifact cues. To address these shortcomings, we propose a general multimodal framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates feature conflicts during fusion while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio). Extensive experiments on both multimodal and unimodal (e.g., image-only/video-only)deepfake benchmarks demonstrate that CAD significantly outperforms previous methods, validating the necessity of harmonious integration of multimodal complementary information.
中文: 提出的跨模态对齐与蒸馏(CAD)框架通过协调模态特定取证痕迹与跨模态语义对齐,显著提升了多模态深度伪造检测的性能。
English: The proposed Cross-Modal Alignment and Distillation (CAD) framework effectively integrates both modality-specific forensic traces and cross-modal semantic alignment to significantly improve multimodal deepfake detection performance.

Authors:Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang
Title: AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection
Abstract:
Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.
中文: 人工智能生成的人类中心化视频技术进展对信息真实性构成严重威胁,为此提出的AvatarShield框架通过群体相对策略优化突破密集文本监督限制,结合双路径架构在新型基准测试中展现出优越的跨域检测能力。
English: Recent advances in AI-generated human-centric videos pose significant threats to information authenticity, prompting the development of AvatarShield—a novel detection framework that eliminates dense textual supervision by leveraging Group Relative Policy Optimization for robust synthetic content identification.

Authors:Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo
Title: Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation
Abstract:
Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.
使用详细描述训练文生图模型可提升生成质量,本研究提出结合图像覆盖率和平均对象细节度的新指标,仅用20%数据即可超越基于长度的方法,实现更优模型性能。
Training text-to-image models with detailed captions enhances generation quality, and this study introduces a new metric combining image coverage rate and average object detailness, which outperforms length-based methods by enabling superior model performance with only 20% of data.

Authors:Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao
Title: Text Generation Beyond Discrete Token Sampling
Abstract:
In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
中文摘要:混合输入(MoI)是一种无需训练的自回归生成方法,通过将离散标记与贝叶斯估计的连续后验相结合来保留标记分布信息,以可忽略的计算开销显著提升了多个模型在文本质量与推理任务上的表现。
English Summary: Mixture of Inputs (MoI) is a training-free autoregressive generation method that preserves token distribution information by blending discrete tokens with Bayesian-estimated continuous posteriors, enhancing text quality and reasoning across multiple models with minimal overhead.

Authors:Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, John "Four" Flynn
Title: Lessons from Defending Gemini Against Indirect Prompt Injections
Abstract:
Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.
中文: Gemini的功能调用能力使其面临不可信数据中恶意指令的风险,为此Google DeepMind通过持续对抗性评估来提升模型抗操纵能力。
English: Gemini's function-calling capabilities expose it to risks from malicious instructions in untrusted data, prompting Google DeepMind to implement continuous adversarial evaluations that enhance the model's resilience against manipulation.

Authors:Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, Bihan Wen
Title: Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling
Abstract:
High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce Sparc3D, a unified framework that combines a sparse deformable marching cubes representation Sparcubes with a novel encoder Sparconv-VAE. Sparcubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. Sparconv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. Sparc3D achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
中文:Sparc3D提出了一种结合稀疏可变形立方体和新型稀疏卷积VAE的统一框架,实现了高保真度的三维重建与生成,克服了现有方法中的细节丢失和模态不匹配问题,同时降低了计算成本。
English: Sparc3D introduces a unified framework using sparse deformable cubes and a novel sparse convolutional VAE to achieve high-fidelity 3D reconstruction and generation, overcoming detail loss and modality mismatches in existing methods while reducing computational costs.

Authors:Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn
Title: Byte Pair Encoding for Efficient Time Series Forecasting
Abstract:
Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 36% and boosts efficiency by 1990% on average. Conditional decoding further reduces MSE by up to 44%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.
中文摘要:该研究提出的基于模式的时序数据分词方法,通过频繁模式自适应压缩数据,将预测性能提升36%、效率提高1990%,而条件解码技术可进一步降低高达44%的误差。
English Summary: The proposed pattern-centric tokenization method for time series adaptively compresses data using frequent motifs, improving forecasting performance by 36% and efficiency by 1990%, while conditional decoding further reduces errors by up to 44%.

Authors:Jingyun Zhang, Hao Peng, Li Sun, Guanlin Wu, Chunyang Liu, Zhengtao Yu
Title: Unsupervised Graph Clustering with Deep Structural Entropy
Abstract:
Research on Graph Structure Learning (GSL) provides key insights for graph-based clustering, yet current methods like Graph Neural Networks (GNNs), Graph Attention Networks (GATs), and contrastive learning often rely heavily on the original graph structure. Their performance deteriorates when the original graph's adjacency matrix is too sparse or contains noisy edges unrelated to clustering. Moreover, these methods depend on learning node embeddings and using traditional techniques like k-means to form clusters, which may not fully capture the underlying graph structure between nodes. To address these limitations, this paper introduces DeSE, a novel unsupervised graph clustering framework incorporating Deep Structural Entropy. It enhances the original graph with quantified structural information and deep neural networks to form clusters. Specifically, we first propose a method for calculating structural entropy with soft assignment, which quantifies structure in a differentiable form. Next, we design a Structural Learning layer (SLL) to generate an attributed graph from the original feature data, serving as a target to enhance and optimize the original structural graph, thereby mitigating the issue of sparse connections between graph nodes. Finally, our clustering assignment method (ASS), based on GNNs, learns node embeddings and a soft assignment matrix to cluster on the enhanced graph. The ASS layer can be stacked to meet downstream task requirements, minimizing structural entropy for stable clustering and maximizing node consistency with edge-based cross-entropy loss. Extensive comparative experiments are conducted on four benchmark datasets against eight representative unsupervised graph clustering baselines, demonstrating the superiority of the DeSE in both effectiveness and interpretability.
中文: 本文提出DeSE无监督图聚类框架,通过深度结构熵增强稀疏或含噪的图结构,并采用创新的结构学习与分配方法直接形成聚类,有效解决了现有方法对原始图结构过度依赖的问题。
English: This paper introduces DeSE, an unsupervised graph clustering framework that addresses the limitations of existing methods by incorporating Deep Structural Entropy to enhance sparse or noisy graph structures and directly form clusters through a novel structural learning and assignment approach.

Authors:Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu
Title: DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery
Abstract:
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large-scale multimodal data processing, limited task automation, and poor support for domain-specific tools. To overcome these limitations, we introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific workflows in drug discovery. DrugPilot enables multi-stage research processes by integrating structured tool use with a novel parameterized memory pool. The memory pool converts heterogeneous data from both public sources and user-defined inputs into standardized representations. This design supports efficient multi-turn dialogue, reduces information loss during data exchange, and enhances complex scientific decision-making. To support training and benchmarking, we construct a drug instruction dataset covering eight core drug discovery tasks. Under the Berkeley function-calling benchmark, DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively. These results highlight DrugPilot's potential as a versatile agent framework for computational science domains requiring automated, interactive, and data-integrated reasoning.
中文: DrugPilot是一种基于大语言模型的智能体系统,通过参数化推理架构和记忆池设计解决药物发现中的多模态数据处理与任务自动化难题,在多项场景中实现了领先的任务完成率。
English: DrugPilot is an advanced LLM-based agent system designed to overcome drug discovery challenges by integrating a parameterized reasoning architecture and memory pool, achieving superior task completion rates in automated scientific workflows.

Authors:Qi Cheng, Licheng Liu, Qing Zhu, Runlong Yu, Zhenong Jin, Yiqun Xie, Xiaowei Jia
Title: LLM-based Evaluation Policy Extraction for Ecological Modeling
Abstract:
Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies.
中文摘要:本研究提出了一种结合度量学习与大语言模型的新框架,通过可解释的评估标准弥合传统数值指标与专家知识之间的差距,有效捕捉生态系统建模中特定领域的时间模式。
English Summary: This study introduces a novel framework combining metric learning with large language models to develop interpretable ecological evaluation criteria that bridge the gap between traditional numerical metrics and expert knowledge, effectively capturing domain-specific temporal patterns in ecosystem modeling.

Authors:Ruiquan Huang, Donghao Li, Chengshuai Shi, Cong Shen, Jing Yang
Title: Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis
Abstract:
This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap $\tilde{O}(\sqrt{1/(N_0/\mathtt{C}(π^*|ρ)+N_1}) )$, where $\mathtt{C}(π^*|ρ)$ is a new concentrability coefficient, $N_0$ and $N_1$ are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant $\tilde{O}( \sqrt{N_1/(N_0/\mathtt{C}(π^{-}|ρ)+N_1)} )$ speed-up compared to pure online learning, where $\mathtt{C}(π^-|ρ)$ is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).
本文提出了一种结合离线数据集与在线交互的混合强化学习框架,其算法在次优差距和遗憾最小化方面均优于纯在线或离线方法,并通过理论分析和实验验证了其优越性能。
This paper introduces a hybrid reinforcement learning framework that combines offline datasets with online interactions, achieving superior performance over standalone methods and demonstrating state-of-the-art results in sub-optimality gap and regret minimization.

Authors:Tanmay Vilas Samak, Chinmay Vilas Samak, Giovanni Martino, Pranav Nair, Venkat Krovi
Title: Digital Twins in the Cloud: A Modular, Scalable and Interoperable Framework for Accelerating Verification and Validation of Autonomous Driving Solutions
Abstract:
Verification and validation (V&V) of autonomous vehicles (AVs) typically requires exhaustive testing across a variety of operating environments and driving scenarios including rare, extreme, or hazardous situations that might be difficult or impossible to capture in reality. Additionally, physical V&V methods such as track-based evaluations or public-road testing are often constrained by time, cost, and safety, which motivates the need for virtual proving grounds. However, the fidelity and scalability of simulation-based V&V methods can quickly turn into a bottleneck. In such a milieu, this work proposes a virtual proving ground that flexibly scales digital twins within high-performance computing clusters (HPCCs) and automates the V&V process. Here, digital twins enable high-fidelity virtual representation of the AV and its operating environments, allowing extensive scenario-based testing. Meanwhile, HPCC infrastructure brings substantial advantages in terms of computational power and scalability, enabling rapid iterations of simulations, processing and storage of massive amounts of data, and deployment of large-scale test campaigns, thereby reducing the time and cost associated with the V&V process. We demonstrate the efficacy of this approach through a case study that focuses on the variability analysis of a candidate autonomy algorithm to identify potential vulnerabilities in its perception, planning, and control sub-systems. The modularity, scalability, and interoperability of the proposed framework are demonstrated by deploying a test campaign comprising 256 test cases on two different HPCC architectures to ensure continuous operation in a publicly shared resource setting. The findings highlight the ability of the proposed framework to accelerate and streamline the V&V process, thereby significantly compressing (~30x) the timeline.
中文: 本研究提出了一种利用数字孪生和高性能计算集群的虚拟试验场,以自动化和扩展自动驾驶汽车的验证与确认过程,显著加快了进度并降低了时间和成本。
English: This study introduces a virtual proving ground that leverages digital twins and high-performance computing clusters to automate and scale the verification and validation of autonomous vehicles, significantly accelerating the process while reducing time and costs.

Authors:Yifeng Jiao, Yuchen Liu, Yu Zhang, Xin Guo, Yushuai Wu, Chen Jiang, Jiyang Li, Hongwei Zhang, Limei Han, Xin Gao, Yuan Qi, Yuan Cheng
Title: ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data
Abstract:
The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
中文摘要:ChromFound作为首个scATAC-seq基础模型,通过混合架构和基因组感知标记化解决了数据稀疏性难题,在零样本细胞识别和多组学分析中展现出卓越性能,为解码非编码基因组提供了新框架。
English Summary: ChromFound is a foundation model for scATAC-seq that overcomes data sparsity challenges through hybrid architecture and genome-aware tokenization, enabling zero-shot cell identification and multi-omics analysis across diverse biological tasks.

Authors:Tiankai Yang, Junjun Liu, Wingchun Siu, Jiahang Wang, Zhuangzhuang Qian, Chanjuan Song, Cheng Cheng, Xiyang Hu, Yue Zhao
Title: AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection
Abstract:
Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.
中文: AD-AGENT是一个基于大语言模型的多智能体框架,可将自然语言指令转化为可执行的异常检测流程,通过协调专业代理和集成常用库,为非专业用户生成可靠脚本并推荐优势模型。
English: AD-AGENT is an LLM-driven multi-agent framework that converts natural-language instructions into executable anomaly detection pipelines, integrating specialized agents and popular libraries to generate reliable scripts and recommend competitive models for non-expert users.

Authors:Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, André Freitas
Title: Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
Abstract:
Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility (i.e., material inference) with logical validity (i.e., formal inference). This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa. Mitigating this limitation is critical, as it undermines the trustworthiness and generalizability of LLMs in applications that demand rigorous logical consistency. This paper investigates the problem of mitigating content biases on formal reasoning through activation steering. Specifically, we curate a controlled syllogistic reasoning dataset to disentangle formal validity from content plausibility. After localising the layers responsible for formal and material inference, we investigate contrastive activation steering methods for test-time interventions. An extensive empirical analysis on different LLMs reveals that contrastive steering consistently supports linear control over content biases. However, we observe that a static approach is insufficient for improving all the tested models. We then leverage the possibility to control content effects by dynamically determining the value of the steering parameters via fine-grained conditional methods. We found that conditional steering is effective on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy with a newly introduced kNN-based method (K-CAST). Finally, additional experiments reveal that steering for content effects is robust to prompt variations, incurs minimal side effects on language modeling capabilities, and can partially generalize to out-of-distribution reasoning tasks. Practically, this paper demonstrates that activation-level interventions can offer a scalable strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased formal reasoning.
中文摘要:大型语言模型常将逻辑有效性与内容合理性相混淆,导致推理偏差,而本研究表明通过激活导向技术可有效减轻此类偏差,提升形式推理的准确性。
English Summary: Large language models often confuse logical validity with content plausibility, leading to biased reasoning, but this study shows that activation steering can effectively mitigate these biases and improve formal reasoning accuracy.

Authors:Shuai Yuan, Guowen Xu, Hongwei Li, Rui Zhang, Xinyuan Qian, Wenbo Jiang, Hangcheng Cao, Qingchuan Zhao
Title: FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition
Abstract:
Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers are invisible under normal conditions and activated stealthily by ultraviolet light, providing superior stealthiness, flexibility, and untraceability. Inspired by real-world graffiti, we derive realistic trigger shapes and enhance their robustness via an interpolation-based fluorescence simulation algorithm. Furthermore, we develop an automated backdoor sample generation method to support three attack objectives. Extensive evaluations in the physical world demonstrate FIGhost's effectiveness against state-of-the-art detectors and VLMs, maintaining robustness under environmental variations and effectively evading existing defenses.
中文摘要:FIGhost是一种利用荧光墨水作为隐形触发器的新型物理后门攻击方法,通过紫外线激活实现对交通标志识别系统和视觉语言模型的有效攻击,兼具隐蔽性与鲁棒性。
English Summary: FIGhost is a novel physical backdoor attack using invisible fluorescent ink triggers activated by UV light, offering enhanced stealth and robustness against traffic sign recognition systems and Vision-Language Models.

Authors:Dexter Ong, Yuezhan Tao, Varun Murali, Igor Spasojevic, Vijay Kumar, Pratik Chaudhari
Title: Gaussian Splatting as a Unified Representation for Autonomy in Unstructured Environments
Abstract:
In this work, we argue that Gaussian splatting is a suitable unified representation for autonomous robot navigation in large-scale unstructured outdoor environments. Such environments require representations that can capture complex structures while remaining computationally tractable for real-time navigation. We demonstrate that the dense geometric and photometric information provided by a Gaussian splatting representation is useful for navigation in unstructured environments. Additionally, semantic information can be embedded in the Gaussian map to enable large-scale task-driven navigation. From the lessons learned through our experiments, we highlight several challenges and opportunities arising from the use of such a representation for robot autonomy.
中文: 高斯泼溅作为一种统一表示法,通过提供密集的几何、光度和可嵌入语义信息,适用于大规模非结构化户外环境中的自主机器人导航,同时为机器人自主性带来了挑战与机遇。
English: Gaussian splatting serves as an effective unified representation for autonomous robot navigation in large-scale unstructured outdoor environments by providing dense geometric, photometric, and embeddable semantic information, while also presenting challenges and opportunities for robot autonomy.

Authors:Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata
Title: Concept-Guided Interpretability via Neural Chunking
Abstract:
Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model's behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.
中文摘要:反射假说提出神经网络内部活动反映训练数据规律,通过认知分块方法可提取可解释概念来因果影响并理解模型行为,从而突破黑箱范式。
English Summary: The Reflection Hypothesis posits that neural networks' internal activity mirrors patterns in training data, and by applying cognitive chunking methods, interpretable concepts can be extracted to causally influence and understand model behavior, moving beyond the black box paradigm.

Authors:Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
Title: Disentangling Reasoning and Knowledge in Medical Large Language Models
Abstract:
Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1 scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.
中文: 当前医学基准常将事实记忆与推理混为一谈,但通过区分二者发现仅32.8%问题需复杂推理,生物医学模型存在明显性能差距且易受对抗测试影响,我们提出的BioMed-R1模型通过针对性训练解决了这一问题。
English: Current medical benchmarks often conflate factual recall with reasoning, but by separating them, we find only 32.8% of questions require complex reasoning, with biomedical models showing significant performance gaps and vulnerability to adversarial tests, which our proposed BioMed-R1 model addresses through targeted training.

Authors:Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He
Title: XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration
Abstract:
Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.
中文摘要:现有大语言模型在支持高质量学术写作方面存在概念连贯性和迭代修订支持的不足,为此我们开发了基于大规模学术修订数据训练的XtraGPT开源模型系列,实验证明其在科学写作辅助方面显著优于同规模基线模型。
English Summary: Large language models currently fall short in supporting high-quality scientific writing due to limitations in conceptual coherence and iterative revision support, prompting the development of XtraGPT—an open-source LLM suite trained on extensive academic revisions that demonstrates superior performance in scientific writing assistance.

Authors:Jae Myung Kim, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
Title: Feasibility with Language Models for Open-World Compositional Zero-Shot Learning
Abstract:
Humans can easily tell if an attribute (also called state) is realistic, i.e., feasible, for an object, e.g. fire can be hot, but it cannot be wet. In Open-World Compositional Zero-Shot Learning, when all possible state-object combinations are considered as unseen classes, zero-shot predictors tend to perform poorly. Our work focuses on using external auxiliary knowledge to determine the feasibility of state-object combinations. Our Feasibility with Language Model (FLM) is a simple and effective approach that leverages Large Language Models (LLMs) to better comprehend the semantic relationships between states and objects. FLM involves querying an LLM about the feasibility of a given pair and retrieving the output logit for the positive answer. To mitigate potential misguidance of the LLM given that many of the state-object compositions are rare or completely infeasible, we observe that the in-context learning ability of LLMs is essential. We present an extensive study identifying Vicuna and ChatGPT as best performing, and we demonstrate that our FLM consistently improves OW-CZSL performance across all three benchmarks.
Chinese: 本研究提出FLM方法,利用大型语言模型评估状态-对象组合的可行性,通过发挥LLM的上下文学习能力,显著提升了开放世界组合零样本学习的性能。
English: The study introduces FLM, a method using Large Language Models to assess the feasibility of state-object combinations, significantly enhancing performance in Open-World Compositional Zero-Shot Learning by leveraging LLMs' in-context learning capabilities.

Authors:Yixin Wan, Kai-Wei Chang
Title: CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback
Abstract:
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to provide fine-grained binary feedback on the correctness of each aspect of generation elements in model-generated images. This enables precise quantification of alignment between generated images and compositional prompts. Furthermore, we propose an alignment framework that uses CompQuest's feedback as preference signals to improve diffusion models' compositional image generation abilities. Using adjustable per-image preferences, our method is easily scalable and flexible for different tasks. Evaluation of 9 T2I models reveals that: (1) models remarkable struggle more with compositional tasks with more complex 3D-spatial configurations, and (2) a noticeable performance gap exists between open-source accessible models and closed-source commercial models. Further empirical study on using CompAlign for model alignment yield promising results: post-alignment diffusion models achieve remarkable improvements in compositional accuracy, especially on complex generation tasks, outperforming previous approaches.
中文:CompAlign是一个专注于评估和改进文本到图像模型在复杂组合场景生成能力的基准数据集,而CompQuest则通过可解释的评估框架为生成图像提供细粒度反馈,并利用偏好信号有效提升扩散模型的组合生成性能。
English: CompAlign is a challenging benchmark designed to evaluate and improve text-to-image models' ability to generate complex compositional scenes with multiple objects and 3D-spatial relationships, while CompQuest provides an interpretable evaluation framework that enables precise alignment assessment and model improvement through preference-based feedback.

Authors:Zihan Wang, Hongwei Li, Rui Zhang, Yu Liu, Wenbo Jiang, Wenshu Fan, Qingchuan Zhao, Guowen Xu
Title: MPMA: Preference Manipulation Attack Against Model Context Protocol
Abstract:
Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack ($\mathtt{DPMA}$) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ($\mathtt{GAPMA}$). $\mathtt{GAPMA}$ employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that $\mathtt{GAPMA}$ balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.
中文: 模型上下文协议(MCP)为大型语言模型接入外部工具提供标准化接口,但其定制化服务器存在被操纵的安全漏洞,需建立防御机制以维护生态公平性。
English: The Model Context Protocol (MCP) enables LLMs to access external tools but faces security risks from manipulated servers that prioritize malicious services, requiring urgent defenses to ensure ecosystem fairness.

Authors:Ping He, Yuhao Mao, Changjiang Li, Lorenzo Cavallaro, Ting Wang, Shouling Ji
Title: On the Security Risks of ML-based Malware Detection Systems: A Survey
Abstract:
Malware presents a persistent threat to user privacy and data integrity. To combat this, machine learning-based (ML-based) malware detection (MD) systems have been developed. However, these systems have increasingly been attacked in recent years, undermining their effectiveness in practice. While the security risks associated with ML-based MD systems have garnered considerable attention, the majority of prior works is limited to adversarial malware examples, lacking a comprehensive analysis of practical security risks. This paper addresses this gap by utilizing the CIA principles to define the scope of security risks. We then deconstruct ML-based MD systems into distinct operational stages, thus developing a stage-based taxonomy. Utilizing this taxonomy, we summarize the technical progress and discuss the gaps in the attack and defense proposals related to the ML-based MD systems within each stage. Subsequently, we conduct two case studies, using both inter-stage and intra-stage analyses according to the stage-based taxonomy to provide new empirical insights. Based on these analyses and insights, we suggest potential future directions from both inter-stage and intra-stage perspectives.
Chinese: 本文通过运用CIA原则界定基于机器学习的恶意软件检测系统的安全风险,构建分阶段分类法分析技术差距,并借助案例研究提供实证见解,从而提出未来研究方向。
English: This paper addresses the limitations of prior research by using the CIA principles to define security risks in ML-based malware detection systems, developing a stage-based taxonomy to analyze technical gaps, and providing empirical insights through case studies to suggest future directions.

Authors:Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan
Title: Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
Abstract:
Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
中文: IPOMP通过语义聚类和实时模型性能的两阶段方法优化提示,显著提升了效果和稳定性,且计算开销极低。
English: IPOMP introduces a two-stage method using semantic clustering and real-time model performance to optimize prompts, significantly enhancing effectiveness and stability with minimal computational cost.

Authors:Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Sebastian Scherer, Marco Hutter, Wenshan Wang
Title: TartanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation
Abstract:
We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 910 trajectories across 70 environments, resulting in 1.5 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase are available on the webpage: https://tartanair.org/tartanground
中文摘要:TartanGround是一个大规模多模态数据集,旨在提升地面机器人在多样化环境中的感知与自主能力,包含丰富的传感器数据,评估显示其在占据预测和SLAM等任务中显著提升模型的泛化性能。
English Summary: TartanGround is a large-scale, multi-modal dataset designed to enhance ground robot perception and autonomy across diverse environments, featuring extensive sensor data and evaluations showing its effectiveness in improving generalization for tasks like occupancy prediction and SLAM.

Authors:Kirill Vasilevski, Benjamin Rombaut, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Keheliya Gallaba, Filipe R. Cogo, Jiahuei Lin, Dayi Lin, Haoxiang Zhang, Bouyan Chen, Kishanthan Thangarajah, Ahmed E. Hassan, Zhen Ming Jiang
Title: The Hitchhikers Guide to Production-ready Trustworthy Foundation Model powered Software (FMware)
Abstract:
Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.
中文: 基础模型通过FMware正在重塑软件开发,其在模型选择、数据对齐和生产部署方面面临诸多挑战,需要战略性的解决方案来构建可靠系统。
English: Foundation Models are revolutionizing software development through FMware, which faces challenges in model selection, data alignment, and production deployment, requiring strategic solutions for building reliable systems.

Authors:Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha
Title: J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Abstract:
The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.
中文: J1强化学习框架通过统一的验证奖励机制优化LLM裁判的思维链推理,在仅使用合成数据训练的情况下,使模型在多项基准测试中达到领先性能。
English: The J1 reinforcement learning framework enhances LLM judges by optimizing their chain-of-thought reasoning through a unified verifiable reward system, achieving state-of-the-art performance across benchmarks with models trained solely on synthetic data.

Authors:Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha
Title: J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Abstract:
The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.
中文: J1强化学习框架通过统一的验证奖励机制优化LLM裁判的思维链推理,在仅使用合成数据训练的情况下,使模型在多项基准测试中达到领先性能。
English: The J1 reinforcement learning framework enhances LLM judges by optimizing their chain-of-thought reasoning through a unified verifiable reward system, achieving state-of-the-art performance across benchmarks with models trained solely on synthetic data.

Authors:Shuning Zhang, Jingruo Chen, Zhiqi Gao, Jiajing Gao, Xin Yi, Hewu Li
Title: Characterizing Unintended Consequences in Human-GUI Agent Collaboration for Web Browsing
Abstract:
The proliferation of Large Language Model (LLM)-based Graphical User Interface (GUI) agents in web browsing scenarios present complex unintended consequences (UCs). This paper characterizes three UCs from three perspectives: phenomena, influence and mitigation, drawing on social media analysis (N=221 posts) and semi-structured interviews (N=14). Key phenomenon for UCs include agents' deficiencies in comprehending instructions and planning tasks, challenges in executing accurate GUI interactions and adapting to dynamic interfaces, the generation of unreliable or misaligned outputs, and shortcomings in error handling and feedback processing. These phenomena manifest as influences from unanticipated actions and user frustration, to privacy violations and security vulnerabilities, and further to eroded trust and wider ethical concerns. Our analysis also identifies user-initiated mitigation, such as technical adjustments and manual oversight, and provides implications for designing future LLM-based GUI agents that are robust, user-centric, and transparent, fostering a crucial balance between automation and human oversight.
中文: 本文揭示了基于大语言模型的图形界面代理在网络浏览中的意外后果,包括操作缺陷和伦理风险,并提出了用户主导的缓解措施及未来代理的设计改进方向。
English: This paper identifies unintended consequences of LLM-based GUI agents in web browsing, including operational deficiencies and ethical risks, and suggests user-led mitigations and design improvements for future agents.

Authors:Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, Guanghui Ren
Title: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
Abstract:
Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real-time interaction with dynamic environments. We propose EnerVerse-AC (EVAC), an action-conditional world model that generates future visual observations based on an agent's predicted actions, enabling realistic and controllable robotic inference. Building on prior architectures, EVAC introduces a multi-level action-conditioning mechanism and ray map encoding for dynamic multi-view image generation while expanding training data with diverse failure trajectories to improve generalization. As both a data engine and evaluator, EVAC augments human-collected trajectories into diverse datasets and generates realistic, action-conditioned video observations for policy testing, eliminating the need for physical robots or complex simulations. This approach significantly reduces costs while maintaining high fidelity in robotic manipulation evaluation. Extensive experiments validate the effectiveness of our method. Code, checkpoints, and datasets can be found at .
中文:EnerVerse-AC (EVAC) 是一种动作条件世界模型,能根据预测动作生成逼真的视觉观测,通过增强数据集和产生可控视频预测,无需物理机器人或复杂仿真即可实现经济高效的策略测试与评估。
English: EnerVerse-AC (EVAC) is an action-conditional world model that generates realistic visual observations for robotic inference, enabling cost-effective policy testing and evaluation without physical robots or complex simulations by augmenting datasets and producing controllable video predictions.

Authors:Yingjie Ma, Xun Lin, Zitong Yu, Xin Liu, Xiaochen Yuan, Weicheng Xie, Linlin Shen
Title: Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing
Abstract:
Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the \textbf{M}ulti\textbf{m}odal \textbf{D}enoising and \textbf{A}lignment (\textbf{MMDA}) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The \textbf{M}odality-\textbf{D}omain Joint \textbf{D}ifferential \textbf{A}ttention (\textbf{MD2A}) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the \textbf{R}epresentation \textbf{S}pace \textbf{S}oft (\textbf{RS2}) Alignment strategy utilizes the pre-trained CLIP model to align multi-domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model's adaptability to various unseen conditions. We also design a \textbf{U}-shaped \textbf{D}ual \textbf{S}pace \textbf{A}daptation (\textbf{U-DSA}) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework's generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state-of-the-art methods in terms of cross-domain generalization and multimodal detection accuracy. The code will be released soon.
中文: MMDA框架通过多模态去噪和对齐机制,利用CLIP的零样本泛化能力,有效提升了人脸防伪系统的跨域泛化性能,在多项基准测试中表现优异。
English: The MMDA framework enhances face anti-spoofing generalization by integrating multimodal denoising and alignment mechanisms, leveraging CLIP's capabilities to outperform existing methods across diverse scenarios.

Authors:Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Sheng Sun
Title: TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving
Abstract:
In recent years, diffusion models have demonstrated remarkable potential across diverse domains, from vision generation to language modeling. Transferring its generative capabilities to modern end-to-end autonomous driving systems has also emerged as a promising direction. However, existing diffusion-based trajectory generative models often exhibit mode collapse where different random noises converge to similar trajectories after the denoising process.Therefore, state-of-the-art models often rely on anchored trajectories from pre-defined trajectory vocabulary or scene priors in the training set to mitigate collapse and enrich the diversity of generated trajectories, but such inductive bias are not available in real-world deployment, which can be challenged when generalizing to unseen scenarios. In this work, we investigate the possibility of effectively tackling the mode collapse challenge without the assumption of pre-defined trajectory vocabulary or pre-computed scene priors. Specifically, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model, where the encoded scene information and motion states serve as the multi-modal conditional input of the denoising decoder. Different from existing approaches, we exploit a simple yet effective multi-modal representation decorrelation optimization mechanism during the denoising process to enrich the latent representation space which better guides the downstream generation. Without any predefined trajectory anchors or pre-computed scene priors, TransDiffuser achieves the PDMS of 94.85 on the closed-loop planning-oriented benchmark NAVSIM, surpassing previous state-of-the-art methods. Qualitative evaluation further showcases TransDiffuser generates more diverse and plausible trajectories which explore more drivable area.
中文摘要:本文提出TransDiffuser模型,通过多模态表征解相关优化机制有效解决扩散模型在自动驾驶轨迹生成中的模式崩溃问题,无需预定义轨迹库或场景先验知识,在NAVSIM基准测试中表现超越现有最优方法。
English Summary: This paper introduces TransDiffuser, a novel trajectory planning model that effectively addresses mode collapse in diffusion-based autonomous driving systems without relying on predefined trajectory anchors or scene priors, achieving state-of-the-art performance on benchmark tests.

Authors:Panqi Chen, Yifan Sun, Lei Cheng, Yang Yang, Weichang Li, Yang Liu, Weiqing Liu, Jiang Bian, Shikai Fang
Title: Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations
Abstract:
Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors. At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches.
中文: SDIFT框架通过在函数Tucker空间中构建序列扩散模型,有效解决了从稀疏不规则观测数据重建多维物理动态的难题,并在多种物理系统中实现了比现有方法更优的重建精度和计算效率。
English: The SDIFT framework addresses the challenge of reconstructing multidimensional physical dynamics from sparse, irregular observations by employing a sequential diffusion model in functional Tucker space, achieving superior accuracy and efficiency across diverse physical systems.

Authors:Panqi Chen, Yifan Sun, Lei Cheng, Yang Yang, Weichang Li, Yang Liu, Weiqing Liu, Jiang Bian, Shikai Fang
Title: Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations
Abstract:
Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors. At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches.
中文: SDIFT框架通过在函数Tucker空间中构建序列扩散模型,有效解决了从稀疏不规则观测数据重建多维物理动态的难题,并在多种物理系统中实现了比现有方法更优的重建精度和计算效率。
English: The SDIFT framework addresses the challenge of reconstructing multidimensional physical dynamics from sparse, irregular observations by employing a sequential diffusion model in functional Tucker space, achieving superior accuracy and efficiency across diverse physical systems.

Authors:Yanru An, Ling Gui, Chunlei Cai, Tianxiao Ye, JIangchao Yao, Guangtao Zhai, Qiang Hu, Xiaoyun Zhang
Title: Instance-aware Image Colorization with Controllable Textual Descriptions and Segmentation Masks
Abstract:
Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.
中文: 本文提出MT-Color方法,通过像素级掩码注意力机制和实例掩码引导实现精确的实例感知图像着色,有效解决色彩渗透和绑定错误问题,实验证明其性能优于现有方法。
English: This paper introduces MT-Color, a diffusion-based method that achieves precise instance-aware image colorization by implementing a pixel-level mask attention mechanism and instance mask guidance to prevent color bleeding and binding errors, with experimental results demonstrating superior performance over existing approaches.

Authors:Bingyu Gao, Mengyu Yao, Ziming Wang, Dong Liu, Ding Li, Xiangqun Chen, Yao Guo
Title: GroupTuner: Efficient Group-Aware Compiler Auto-Tuning
Abstract:
Modern compilers typically provide hundreds of options to optimize program performance, but users often cannot fully leverage them due to the huge number of options. While standard optimization combinations (e.g., -O3) provide reasonable defaults, they often fail to deliver near-peak performance across diverse programs and architectures. To address this challenge, compiler auto-tuning techniques have emerged to automate the discovery of improved option combinations. Existing techniques typically focus on identifying critical options and prioritizing them during the search to improve efficiency. However, due to limited tuning iterations, the resulting data is often sparse and noisy, making it highly challenging to accurately identify critical options. As a result, these algorithms are prone to being trapped in local optima. To address this limitation, we propose GroupTuner, a group-aware auto-tuning technique that directly applies localized mutation to coherent option groups based on historically best-performing combinations, thus avoiding explicitly identifying critical options. By forgoing the need to know precisely which options are most important, GroupTuner maximizes the use of existing performance data, ensuring more targeted exploration. Extensive experiments demonstrate that GroupTuner can efficiently discover competitive option combinations, achieving an average performance improvement of 12.39% over -O3 while requiring only 77.21% of the time compared to the random search algorithm, significantly outperforming state-of-the-art methods.
Chinese: GroupTuner是一种创新的编译器自动调优技术,它基于历史性能数据对选项组进行局部变异,从而在仅需随机搜索77.21%时间的情况下,实现了比-O3优化平均12.39%的性能提升。
English: GroupTuner is a novel compiler auto-tuning technique that enhances optimization by applying localized mutations to option groups based on historical data, achieving a 12.39% performance gain over -O3 with 77.21% of the time required by random search.

Authors:Ruixiao Shi, Fu Feng, Yucheng Xie, Jing Wang, Xin Geng
Title: FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning
Abstract:
Cross-domain few-shot learning (CD-FSL) requires models to generalize from limited labeled samples under significant distribution shifts. While recent methods enhance adaptability through lightweight task-specific modules, they operate solely in the spatial domain and overlook frequency-specific variations that are often critical for robust transfer. We observe that spatially similar images across domains can differ substantially in their spectral representations, with low and high frequencies capturing complementary semantic information at coarse and fine levels. This indicates that uniform spatial adaptation may overlook these spectral distinctions, thus constraining generalization. To address this, we introduce Frequency Adaptation and Diversion (FAD), a frequency-aware framework that explicitly models and modulates spectral components. At its core is the Frequency Diversion Adapter, which transforms intermediate features into the frequency domain using the discrete Fourier transform (DFT), partitions them into low, mid, and high-frequency bands via radial masks, and reconstructs each band using inverse DFT (IDFT). Each frequency band is then adapted using a dedicated convolutional branch with a kernel size tailored to its spectral scale, enabling targeted and disentangled adaptation across frequencies. Extensive experiments on the Meta-Dataset benchmark demonstrate that FAD consistently outperforms state-of-the-art methods on both seen and unseen domains, validating the utility of frequency-domain representations and band-wise adaptation for improving generalization in CD-FSL.
中文: 该研究提出了频率适应与分流(FAD)框架,通过显式建模和调整低、中、高频段的频谱分量,有效提升了跨域小样本学习的泛化能力,在Meta-Dataset基准测试中全面优于现有先进方法。
English: The study introduces Frequency Adaptation and Diversion (FAD), a frequency-aware framework that enhances cross-domain few-shot learning by explicitly modeling and adapting spectral components in low, mid, and high-frequency bands, achieving superior generalization on the Meta-Dataset benchmark compared to existing methods.

Authors:William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, Sergey Levine
Title: Training Strategies for Efficient Embodied Reasoning
Abstract:
Robot chain-of-thought reasoning (CoT) -- wherein a model predicts helpful intermediate representations before choosing actions -- provides an effective method for improving the generalization and performance of robot policies, especially vision-language-action models (VLAs). While such approaches have been shown to improve performance and generalization, they suffer from core limitations, like needing specialized robot reasoning data and slow inference speeds. To design new robot reasoning approaches that address these issues, a more complete characterization of why reasoning helps policy performance is critical. We hypothesize several mechanisms by which robot reasoning improves policies -- (1) better representation learning, (2) improved learning curricularization, and (3) increased expressivity -- then devise simple variants of robot CoT reasoning to isolate and test each one. We find that learning to generate reasonings does lead to better VLA representations, while attending to the reasonings aids in actually leveraging these features for improved action prediction. Our results provide us with a better understanding of why CoT reasoning helps VLAs, which we use to introduce two simple and lightweight alternative recipes for robot reasoning. Our proposed approaches achieve significant performance gains over non-reasoning policies, state-of-the-art results on the LIBERO-90 benchmark, and a 3x inference speedup compared to standard robot reasoning.
中文: 机器人思维链推理通过提升表征学习和特征利用来增强策略性能,由此产生的轻量级方法实现了最优性能并显著加快了推理速度。
English: Robot chain-of-thought reasoning enhances policy performance by improving representation learning and feature utilization, leading to new lightweight methods that achieve state-of-the-art results and faster inference.

Authors:Weizhi Fei, Zihao Wang, hang Yin, Shukai Zhao, Wei Zhang, Yangqiu Song
Title: Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering
Abstract:
Complex Query Answering (CQA) aims to retrieve answer sets for complex logical formulas from incomplete knowledge graphs, which is a crucial yet challenging task in knowledge graph reasoning. While neuro-symbolic search utilized neural link predictions achieve superior accuracy, they encounter significant complexity bottlenecks: (i) Data complexity typically scales quadratically with the number of entities in the knowledge graph, and (ii) Query complexity becomes NP-hard for cyclic queries. Consequently, these approaches struggle to effectively scale to larger knowledge graphs and more complex queries. To address these challenges, we propose an efficient and scalable symbolic search framework. First, we propose two constraint strategies to compute neural logical indices to reduce the domain of variables, thereby decreasing the data complexity of symbolic search. Additionally, we introduce an approximate algorithm based on local search to tackle the NP query complexity of cyclic queries. Experiments on various CQA benchmarks demonstrate that our framework reduces the computational load of symbolic methods by 90\% while maintaining nearly the same performance, thus alleviating both efficiency and scalability issues.
中文:我们提出的符号搜索框架通过神经逻辑索引降低数据复杂度,并采用局部搜索近似算法应对NP难的查询复杂度,在保持性能的同时将计算负担减少90%,有效解决了复杂查询应答中的效率和可扩展性瓶颈。
English: Our proposed symbolic search framework addresses efficiency and scalability bottlenecks in Complex Query Answering by reducing data complexity through neural logical indices and tackling NP-hard query complexity with a local search approximation, achieving 90% computational savings with minimal performance loss.

Authors:Perry Dong, Suvir Mirchandani, Dorsa Sadigh, Chelsea Finn
Title: What Matters for Batch Online Reinforcement Learning in Robotics?
Abstract:
The ability to learn from large batches of autonomously collected data for policy improvement -- a paradigm we refer to as batch online reinforcement learning -- holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online RL in robotics. Motivated by this question, we perform a systematic empirical study of three axes -- (i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity -- and analyze how these axes affect performance and scaling with the amount of autonomous data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction -- via choosing the best action in the distribution of the policy -- is necessary over traditional policy extraction methods from offline RL. Next, we show that an expressive policy class is preferred over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe of using temporally-correlated noise to obtain more diversity results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
中文: 批量在线强化学习通过利用自主收集的数据实现可扩展的机器人学习,我们的研究表明,Q函数引导、隐式策略提取和表达性策略类能显著提升性能和数据利用效率。
English: Batch online reinforcement learning enables scalable robot learning by leveraging autonomous data collection, with our study demonstrating that Q-function guidance, implicit policy extraction, and expressive policy classes significantly enhance performance and data efficiency.

Authors:Pengyu Wang, Hin Wang Lin, Jialu Li, Jiankun Wang, Ling Shi, Max Q. -H. Meng
Title: PierGuard: A Planning Framework for Underwater Robotic Inspection of Coastal Piers
Abstract:
Using underwater robots instead of humans for the inspection of coastal piers can enhance efficiency while reducing risks. A key challenge in performing these tasks lies in achieving efficient and rapid path planning within complex environments. Sampling-based path planning methods, such as Rapidly-exploring Random Tree* (RRT*), have demonstrated notable performance in high-dimensional spaces. In recent years, researchers have begun designing various geometry-inspired heuristics and neural network-driven heuristics to further enhance the effectiveness of RRT*. However, the performance of these general path planning methods still requires improvement when applied to highly cluttered underwater environments. In this paper, we propose PierGuard, which combines the strengths of bidirectional search and neural network-driven heuristic regions. We design a specialized neural network to generate high-quality heuristic regions in cluttered maps, thereby improving the performance of the path planning. Through extensive simulation and real-world ocean field experiments, we demonstrate the effectiveness and efficiency of our proposed method compared with previous research. Our method achieves approximately 2.6 times the performance of the state-of-the-art geometric-based sampling method and nearly 4.9 times that of the state-of-the-art learning-based sampling method. Our results provide valuable insights for the automation of pier inspection and the enhancement of maritime safety. The updated experimental video is available in the supplementary materials.
中文: 本文提出PierGuard方法,通过结合双向搜索与神经网络驱动的启发式区域生成技术,显著提升了水下机器人在复杂码头检测环境中的路径规划效率,实验证明其性能分别达到几何采样方法的2.6倍和学习采样方法的4.9倍。
English: This paper introduces PierGuard, a novel path planning method that integrates bidirectional search with neural network-generated heuristic regions to significantly enhance underwater robot navigation efficiency in cluttered pier inspection environments, demonstrating performance improvements of 2.6x over geometric-based and 4.9x over learning-based sampling methods.

Authors:Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan
Title: Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
Abstract:
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
中文摘要:Step1X-3D框架通过数据筛选、混合架构和开源发布解决了3D生成的关键难题,在建立可控3D资产生成新标准的同时,成功连接了2D与3D生成范式。
English Summary: The Step1X-3D framework addresses 3D generation challenges through curated datasets, hybrid architecture, and full open-source release, establishing new standards for controllable 3D asset generation while bridging 2D and 3D paradigms.

Authors:Yanhui Hong, Nan Wang, Zhiyi Xia, Haoyi Tao, Xi Fang, Yiming Li, Jiankun Wang, Peng Jin, Xiaochen Cai, Shengyu Li, Ziqi Chen, Zezhong Zhang, Guolin Ke, Linfeng Zhang
Title: Uni-AIMS: AI-Powered Microscopy Image Analysis
Abstract:
This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research. A online application is made available for researchers to access and evaluate the proposed automated analysis service.
本文提出了一种智能显微镜图像分析系统,通过数据引擎生成高质量标注数据集,采用鲁棒分割模型精确识别目标,并构建了可在线访问的实用分析平台。
This paper introduces an intelligent system for microscopy image analysis, featuring a robust segmentation model and a data engine that generates high-quality annotated datasets, validated through a practical online platform.

Authors:Jian Song, Hongruixuan Chen, Naoto Yokoya
Title: Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction
Abstract:
Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions -- Saint-Omer, Tokyo, and Sao Paulo -- reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8\%, 6.9\%, and 4.9\%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.
中文摘要:本研究揭示了基于合成数据的单目高度估计模型过度依赖阴影线索导致预测偏差,并提出一种融合稀疏激光雷达数据与深度学习的校正方法,在多个城市区域显著降低了高度估计误差。
English Summary: This study reveals that monocular height estimation models trained on synthetic data heavily depend on shadow cues, leading to inaccuracies, and proposes a correction pipeline combining sparse LiDAR data with deep learning to significantly reduce errors in urban areas.

Authors:Jian Song, Hongruixuan Chen, Naoto Yokoya
Title: Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction
Abstract:
Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions -- Saint-Omer, Tokyo, and Sao Paulo -- reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8\%, 6.9\%, and 4.9\%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.
中文摘要:本研究揭示了基于合成数据的单目高度估计模型过度依赖阴影线索导致预测偏差,并提出一种融合稀疏激光雷达数据与深度学习的校正方法,在多个城市区域显著降低了高度估计误差。
English Summary: This study reveals that monocular height estimation models trained on synthetic data heavily depend on shadow cues, leading to inaccuracies, and proposes a correction pipeline combining sparse LiDAR data with deep learning to significantly reduce errors in urban areas.

Authors:Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, Xianyuan Zhan
Title: Efficient Robotic Policy Learning via Latent Space Backward Planning
Abstract:
Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose a Latent Space Backward Planning scheme (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page: https://lbp-authors.github.io
中文: 现有机器人规划方法存在效率和精度问题,为此我们提出潜在空间逆向规划方案,从最终目标出发递归预测中间子目标,确保任务对齐并实现实时控制。
English: Current robotic planning methods face efficiency and accuracy challenges, so we propose a Latent Space Backward Planning (LBP) scheme that starts from final goals and recursively predicts intermediate subgoals to ensure task alignment and real-time performance.

Authors:Qi Cheng, Licheng Liu, Yao Zhang, Mu Hong, Shiyuan Luo, Zhenong Jin, Yiqun Xie, Xiaowei Jia
Title: Knowledge Guided Encoder-Decoder Framework: Integrating Multiple Physical Models for Agricultural Ecosystem Modeling
Abstract:
Agricultural monitoring is critical for ensuring food security, maintaining sustainable farming practices, informing policies on mitigating food shortage, and managing greenhouse gas emissions. Traditional process-based physical models are often designed and implemented for specific situations, and their parameters could also be highly uncertain. In contrast, data-driven models often use black-box structures and does not explicitly model the inter-dependence between different ecological variables. As a result, they require extensive training data and lack generalizability to different tasks with data distribution shifts and inconsistent observed variables. To address the need for more universal models, we propose a knowledge-guided encoder-decoder model, which can predict key crop variables by leveraging knowledge of underlying processes from multiple physical models. The proposed method also integrates a language model to process complex and inconsistent inputs and also utilizes it to implement a model selection mechanism for selectively combining the knowledge from different physical models. Our evaluations on predicting carbon and nitrogen fluxes for multiple sites demonstrate the effectiveness and robustness of the proposed model under various scenarios.
Chinese: 提出的知识引导编码器-解码器模型融合多种物理模型和语言模型,有效预测关键作物变量,在多种场景下的碳氮通量农业监测中展现出卓越的稳健性。
English: The proposed knowledge-guided encoder-decoder model integrates multiple physical models and a language model to effectively predict key crop variables, demonstrating robustness in agricultural monitoring for carbon and nitrogen flux predictions across diverse scenarios.

Authors:Yizhuo Wu, Yi Zhu, Kun Qian, Qinyu Chen, Anding Zhu, John Gajadharsing, Leo C. N. de Vreede, Chang Gao
Title: DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion
Abstract:
Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This paper introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200MHz-BW 256-QAM OFDM signal to a 3.5 GHz GaN Doherty RF PA, DeltaDPD achieves -50.03 dBc in Adjacent Channel Power Ratio (ACPR), -37.22 dB in Normalized Mean Square Error (NMSE) and -38.52 dBc in Error Vector Magnitude (EVM) with 52% temporal sparsity, leading to a 1.8X reduction in estimated inference power. The DeltaDPD code will be released after formal publication at https://www.opendpd.com.
中文摘要:DeltaDPD通过利用循环神经网络中的动态时间稀疏性,提出了一种节能的数字预失真方法,在保持优异线性化性能的同时将推理功耗降低1.8倍。
English Summary: DeltaDPD introduces an energy-efficient digital predistortion method by leveraging dynamic temporal sparsity in RNNs, achieving significant performance metrics while reducing inference power by 1.8 times.

Authors:Faizan Farooq Khan, Jun Chen, Youssef Mohamed, Chun-Mei Feng, Mohamed Elhoseiny
Title: VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models
Abstract:
Open-vocabulary recognition remains a challenging problem in computer vision, as it requires identifying objects from an unbounded set of categories. This is particularly relevant in nature, where new species are discovered every year. In this work, we focus on open-vocabulary bird species recognition, where the goal is to classify species based on their descriptions without being constrained to a predefined set of taxonomic categories. Traditional benchmarks like CUB-200-2011 and Birdsnap have been evaluated in a closed-vocabulary paradigm, limiting their applicability to real-world scenarios where novel species continually emerge. We show that the performance of current systems when evaluated under settings closely aligned with open-vocabulary drops by a huge margin. To address this gap, we propose a scalable framework integrating structured textual knowledge from Wikipedia articles of 11,202 bird species distilled via GPT-4o into concise, discriminative summaries. We propose Visual Re-ranking Retrieval-Augmented Generation(VR-RAG), a novel, retrieval-augmented generation framework that uses visual similarities to rerank the top m candidates retrieved by a set of multimodal vision language encoders. This allows for the recognition of unseen taxa. Extensive experiments across five established classification benchmarks show that our approach is highly effective. By integrating VR-RAG, we improve the average performance of state-of-the-art Large Multi-Modal Model QWEN2.5-VL by 15.4% across five benchmarks. Our approach outperforms conventional VLM-based approaches, which struggle with unseen species. By bridging the gap between encyclopedic knowledge and visual recognition, our work advances open-vocabulary recognition, offering a flexible, scalable solution for biodiversity monitoring and ecological research.
中文:提出的视觉重排检索增强生成(VR-RAG)框架通过结合结构化百科知识与视觉信息,显著提升了开放词汇物种识别的性能,在多个基准测试中表现优异。
English: The proposed Visual Re-ranking Retrieval-Augmented Generation (VR-RAG) framework enhances open-vocabulary species recognition by integrating structured encyclopedic knowledge with visual information, significantly improving performance across benchmarks.

Authors:Faizan Farooq Khan, Jun Chen, Youssef Mohamed, Chun-Mei Feng, Mohamed Elhoseiny
Title: Neural Catalog: Scaling Species Recognition with Catalog of Life-Augmented Generation
Abstract:
Open-vocabulary species recognition is a major challenge in computer vision, particularly in ornithology, where new taxa are continually discovered. While benchmarks like CUB-200-2011 and Birdsnap have advanced fine-grained recognition under closed vocabularies, they fall short of real-world conditions. We show that current systems suffer a performance drop of over 30\% in realistic open-vocabulary settings with thousands of candidate species, largely due to an increased number of visually similar and semantically ambiguous distractors. To address this, we propose Visual Re-ranking Retrieval-Augmented Generation (VR-RAG), a novel framework that links structured encyclopedic knowledge with recognition. We distill Wikipedia articles for 11,202 bird species into concise, discriminative summaries and retrieve candidates from these summaries. Unlike prior text-only approaches, VR-RAG incorporates visual information during retrieval, ensuring final predictions are both textually relevant and visually consistent with the query image. Extensive experiments across five bird classification benchmarks and two additional domains show that VR-RAG improves the average performance of the state-of-the-art Qwen2.5-VL model by 18.0%.
中文:提出的视觉重排检索增强生成(VR-RAG)框架通过结合结构化百科知识与视觉信息,显著提升了开放词汇物种识别的性能,在多个基准测试中表现优异。
English: The proposed Visual Re-ranking Retrieval-Augmented Generation (VR-RAG) framework enhances open-vocabulary species recognition by integrating structured encyclopedic knowledge with visual information, significantly improving performance across benchmarks.

Authors:Noriaki Hirose, Lydia Ignatova, Kyle Stachowicz, Catherine Glossop, Sergey Levine, Dhruv Shah
Title: Learning to Drive Anywhere with Model-Based Reannotation
Abstract:
Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy's ability to generalize and navigate effectively even amidst pedestrians in crowded settings.
中文:MBRA框架通过重新标注被动数据训练LogoNav导航策略,实现了在六大洲六个城市的真实环境中超过300米的先进泛化能力,即使在拥挤场景下也能有效导航。
English: The Model-Based ReAnnotation (MBRA) framework enhances robot navigation by relabeling passive data sources to train LogoNav, a policy achieving state-of-the-art generalization over 300 meters in diverse, unseen environments across global real-world tests.

Authors:Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, Chuang Gan
Title: Learning 3D Persistent Embodied World Models
Abstract:
The ability to simulate the effects of future actions on the world is a crucial ability of intelligent embodied agents, enabling agents to anticipate the effects of their actions and make plans accordingly. While a large body of existing work has explored how to construct such world models using video models, they are often myopic in nature, without any memory of a scene not captured by currently observed images, preventing agents from making consistent long-horizon plans in complex environments where many parts of the scene are partially observed. We introduce a new persistent embodied world model with an explicit memory of previously generated content, enabling much more consistent long-horizon simulation. During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent. This generation is then aggregated into a persistent 3D map of the environment. By conditioning the video model on this 3D spatial map, we illustrate how this enables video world models to faithfully simulate both seen and unseen parts of the world. Finally, we illustrate the efficacy of such a world model in downstream embodied applications, enabling effective planning and policy learning.
Chinese: 本研究提出了一种具有持久性的具身世界模型,通过显式记忆先前生成的内容,实现了连贯的长期模拟,使智能体能够在复杂环境中有效规划和进行策略学习。
English: This research introduces a persistent embodied world model that uses an explicit memory of past content to enable consistent long-horizon simulations, allowing agents to effectively plan and learn policies in complex environments.

Authors:Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu
Title: Generating Physically Stable and Buildable Brick Structures from Text
Abstract:
We introduce BrickGPT, the first approach for generating physically stable interconnecting brick assembly models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of brick structures, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that BrickGPT produces stable, diverse, and aesthetically pleasing brick structures that align closely with the input text prompts. We also develop a text-based brick texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We release our new dataset, StableText2Brick, containing over 47,000 brick structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/BrickGPT/.
中文: BrickGPT是首个通过文本提示生成物理稳定互锁砖块模型的方法,它基于大规模数据集训练,并在推理中结合物理验证,确保设计稳定、多样且符合文本描述。
English: BrickGPT is the first method that generates physically stable interlocking brick models from text prompts by training on a large dataset and incorporating physics-based validation during inference, producing designs that are stable, diverse, and align well with input descriptions.

Authors:Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez
Title: Reasoning Models Don't Always Say What They Think
Abstract:
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
中文: 思维链监控在检测AI行为方面显示出潜力,但由于其常常无法忠实反映模型的推理过程,尤其对于罕见的灾难性事件,因此不足以确保安全性。
English: Chain-of-thought monitoring shows promise for detecting AI behaviors but is insufficient to ensure safety, as it often fails to faithfully represent models' reasoning processes, especially for rare catastrophic events.

Authors:Thomas Grübl, Weijie Niu, Jan von der Assen, Burkhard Stiller
Title: QUIC-Exfil: Exploiting QUIC's Server Preferred Address Feature to Perform Data Exfiltration Attacks
Abstract:
The QUIC protocol is now widely adopted by major tech companies and accounts for a significant fraction of today's Internet traffic. QUIC's multiplexing capabilities, encrypted headers, dynamic IP address changes, and encrypted parameter negotiations make the protocol not only more efficient, secure, and censorship-resistant, but also practically unmanageable by firewalls. This opens doors for attackers who may exploit certain traits of the QUIC protocol to perform targeted attacks, such as data exfiltration attacks. Whereas existing data exfiltration techniques, such as TLS and DNS-based exfiltration, can be detected on a firewall level, QUIC-based data exfiltration is more difficult to detect, since changes in IP addresses and ports are inherent to the protocol's normal behavior. To show the feasibility of a QUIC-based data exfiltration attack, we introduce a novel method leveraging the server preferred address feature of the QUIC protocol and, thus, allows an attacker to exfiltrate sensitive data from an infected machine to a malicious server, disguised as a server-side connection migration. The attack is implemented as a proof of concept tool in Rust. We evaluated the performance of five anomaly detection classifiers - Random Forest, Multi-Layer Perceptron, Support Vector Machine, Autoencoder, and Isolation Forest - trained on datasets collected from three network traffic scenarios. The classifiers were trained on over 700K benign and malicious QUIC packets and 786 connection migration events, but were unable to detect the data exfiltration attempts. Furthermore, post-analysis of the traffic captures did not reveal any identifiable fingerprint. As part of our evaluation, we also interviewed five leading firewall vendors and found that, as of today, no major firewall vendor implements functionality capable of distinguishing between benign and malicious QUIC connection migrations.
中文: QUIC协议因其加密头部和动态IP变更等特性,使得基于该协议的数据外泄攻击难以被防火墙和异常检测系统识别,现有防护手段均无法有效区分恶意与正常连接迁移。
English: The QUIC protocol's inherent features like encrypted headers and dynamic IP changes enable undetectable data exfiltration attacks, which current firewalls and anomaly detection systems fail to identify despite extensive testing.

Authors:Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He
Title: "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Abstract:
The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.
中文: 本研究评估了实时视觉语言模型在辅助视障人士方面的有效性,发现其在动态环境中感知潜在危险存在困难,并通过构建新数据集和引入轮询机制提出解决方案,以增强环境安全意识。
English: This study evaluates the effectiveness of real-time vision-language models in assisting visually impaired individuals, identifying challenges in dynamic hazard perception and proposing solutions through a new dataset and polling mechanism to enhance environmental safety awareness.

Authors:Lynnette Hui Xian Ng, Wenqi Zhou, Kathleen M. Carley
Title: Appeal and Scope of Misinformation Spread by AI Agents and Humans
Abstract:
This work examines the influence of misinformation and the role of AI agents, called bots, on social network platforms. To quantify the impact of misinformation, it proposes two new metrics based on attributes of tweet engagement and user network position: Appeal, which measures the popularity of the tweet, and Scope, which measures the potential reach of the tweet. In addition, it analyzes 5.8 million misinformation tweets on the COVID-19 vaccine discourse over three time periods: Pre-Vaccine, Vaccine Launch, and Post-Vaccine. Results show that misinformation was more prevalent during the first two periods. Human-generated misinformation tweets tend to have higher appeal and scope compared to bot-generated ones. Tweedie regression analysis reveals that human-generated misinformation tweets were most concerning during Vaccine Launch week, whereas bot-generated misinformation reached its highest appeal and scope during the Pre-Vaccine period.
中文摘要:本研究通过引入衡量推文受欢迎度的"吸引力"和传播范围的"覆盖度"指标,分析社交网络中的虚假信息影响,发现疫苗推广阶段人类传播的虚假信息比机器人生成的内容更具影响力。
English Summary: This study analyzes misinformation's impact on social networks, introducing "Appeal" and "Scope" metrics to evaluate tweet popularity and reach, finding human-generated misinformation more influential than bot-generated content during vaccine rollout phases.

Authors:Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
Abstract:
Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/
中文摘要:FlexiAct提出了一种将参考视频中的动作迁移到任意目标图像的新方法,通过RefAdapter和FAE技术突破布局与视角等空间限制,同时保持身份一致性。
English Summary: FlexiAct is a novel method that transfers actions from a reference video to any target image, overcoming spatial constraints like layout and viewpoint through RefAdapter and FAE techniques while maintaining identity consistency.

Authors:Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng
Title: Distribution-Conditional Generation: From Class Distribution to Creative Generation
Abstract:
Text-to-image (T2I) diffusion models are effective at producing semantically aligned images, but their reliance on training data distributions limits their ability to synthesize truly novel, out-of-distribution concepts. Existing methods typically enhance creativity by combining pairs of known concepts, yielding compositions that, while out-of-distribution, remain linguistically describable and bounded within the existing semantic space. Inspired by the soft probabilistic outputs of classifiers on ambiguous inputs, we propose Distribution-Conditional Generation, a novel formulation that models creativity as image synthesis conditioned on class distributions, enabling semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that maps class distributions into a latent space and decodes them into tokens of creative concept. DisTok maintains a dynamic concept pool and iteratively sampling and fusing concept pairs, enabling the generation of tokens aligned with increasingly complex class distributions. To enforce distributional consistency, latent vectors sampled from a Gaussian prior are decoded into tokens and rendered into images, whose class distributions-predicted by a vision-language model-supervise the alignment between input distributions and the visual semantics of generated tokens. The resulting tokens are added to the concept pool for subsequent composition. Extensive experiments demonstrate that DisTok, by unifying distribution-conditioned fusion and sampling-based synthesis, enables efficient and flexible token-level generation, achieving state-of-the-art performance with superior text-image alignment and human preference scores.
中文: 文本到图像扩散模型难以生成超越训练数据的新颖概念,但提出的DisTok框架通过分布条件生成和迭代概念融合,实现了语义无约束的创造性合成,在图文对齐和人类偏好方面达到最优性能。
English: Text-to-image diffusion models struggle with generating truly novel concepts beyond their training data, but the proposed DisTok framework introduces distribution-conditional generation and iterative concept fusion to enable semantically unconstrained creative synthesis with state-of-the-art alignment and human preference.

Authors:Zihan Wang, Hongwei Li, Rui Zhang, Wenbo Jiang, Kangjie Chen, Tianwei Zhang, Qingchuan Zhao, Guowen Xu
Title: BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models
Abstract:
In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness
Chinese: 本文提出了一种针对大型语言模型的新型语言后门攻击,其中语言本身作为触发条件诱导有害输出,并设计了任务无关的BadLingual方法,显著提高了跨任务攻击成功率。
English: This paper introduces a novel lingual-backdoor attack on Large Language Models, where the language itself acts as a trigger to induce harmful outputs, and proposes BadLingual, a task-agnostic method that significantly improves attack success rates across various tasks.

Authors:Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia
Title: Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models
Abstract:
Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
中文: 本文提出首个针对文本到图像扩散模型的版权规避攻击CEAT2I,通过检测水印样本、识别触发标记和消除水印三个步骤,在保持模型性能的同时有效规避数据集所有权验证机制。
English: This paper introduces CEAT2I, the first copyright evasion attack designed to bypass dataset ownership verification in text-to-image diffusion models by detecting watermarked samples, identifying trigger tokens, and erasing watermarks while maintaining model performance.

Authors:Yongxiang Li, Yuan Sun, Yang Qin, Dezhong Peng, Xi Peng, Peng Hu
Title: Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identification
Abstract:
Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.
中文: 本文提出了一种新颖的鲁棒对偶学习框架RoDE,通过自适应学习机制、双模型交替训练和聚类一致性匹配,有效解决了无监督可见光-红外行人重识别中的伪标签噪声问题,显著提升了跨模态检索性能。
English: This paper introduces a novel Robust Duality Learning framework (RoDE) that addresses pseudo-label noise in unsupervised visible-infrared person re-identification through adaptive learning, dual-model training, and cluster consistency matching to improve cross-modal retrieval performance.

Authors:Lei Mao, Yuanhe Tian, Yan Song
Title: DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units
Abstract:
Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences, including small polymers and G-grams that are combinations of several contiguous polymers. Specifically, we extract the G-grams from large-scale genomic corpora through an unsupervised approach to construct the G-gram vocabulary, which is used to provide G-grams in the learning process of DNA sequences through dynamically matching from running gene samples. A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit (E4BU), which is responsible for encoding small units and maintaining the learning and inference process. To further enhance the learning process, we propose whole G-gram masking to train DNAZEN, where the model largely favors the selection of each entire G-gram to mask rather than an ordinary masking mechanism performed on basic units. Experiments on benchmark datasets demonstrate the effectiveness of DNAZEN on various downstream tasks.
中文: 本文提出DNAZEN基因组表示框架,通过无监督方法构建G-gram词汇表,并采用基于Transformer的编码器从基因序列的多粒度中学习表征,实验证明其在多项下游任务中具有显著效果。
English: This paper introduces DNAZEN, a genomic representation framework that learns from multiple granularities in gene sequences, using an unsupervised approach to construct a G-gram vocabulary and a Transformer-based encoder to enhance representation learning, with experiments confirming its effectiveness on various tasks.

Authors:Yingda Fan, Runlong Yu, Janet R. Barclay, Alison P. Appling, Yiming Sun, Yiqun Xie, Xiaowei Jia
Title: Multi-Scale Graph Learning for Anti-Sparse Downscaling
Abstract:
Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, $\leq$ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that scale.To address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.
中文: 提出的多尺度图学习方法通过利用跨尺度水文连接和异步训练来解决精细尺度水温预测难题,从而提升模型性能。
English: The proposed Multi-Scale Graph Learning method addresses fine-scale water temperature prediction challenges by leveraging cross-scale hydrological connections and asynchronous training to enhance model performance.

Authors:Jing Liu, Yao Du, Kun Yang, Jiaqi Wu, Yan Wang, Xiping Hu, Zehua Wang, Yang Liu, Peng Sun, Azzedine Boukerche, Victor C. M. Leung
Title: Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Abstract:
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
中文: 本综述全面探讨了边缘云协同计算中人工智能与分布式智能的融合,涵盖架构、优化技术和应用场景,并指出了应对部署与资源管理挑战的未来研究方向。
English: This survey comprehensively examines the integration of AI and distributed intelligence in edge-cloud collaborative computing, covering architectures, optimization techniques, and applications while identifying future research directions to address challenges in deployment and resource management.

Authors:Antonio Flores-Montoya, Junghee Lim, Adam Seitz, Akshay Sood, Edward Raff, James Holt
Title: Disassembly as Weighted Interval Scheduling with Learned Weights
Abstract:
Disassembly is the first step of a variety of binary analysis and transformation techniques, such as reverse engineering, or binary rewriting. Recent disassembly approaches consist of three phases: an exploration phase, that overapproximates the binary's code; an analysis phase, that assigns weights to candidate instructions or basic blocks; and a conflict resolution phase, that downselects the final set of instructions. We present a disassembly algorithm that generalizes this pattern for a wide range of architectures, namely x86, x64, arm32, and aarch64. Our algorithm presents a novel conflict resolution method that reduces disassembly to weighted interval scheduling.
中文: 本文提出了一种适用于多种架构的反汇编算法,将三阶段方法通用化,并采用加权区间调度的新型冲突解决策略。
English: This paper introduces a disassembly algorithm that generalizes a three-phase approach for multiple architectures and proposes a novel conflict resolution method using weighted interval scheduling.

Authors:Ivor van der Hoog, Eva Rotenberg, Daniel Rutschmann
Title: A Combinatorial Proof of Universal Optimality for Computing a Planar Convex Hull
Abstract:
For a planar point set $P$, its convex hull is the smallest convex polygon that encloses all points in $P$. The construction of the convex hull from an array $I_P$ containing $P$ is a fundamental problem in computational geometry. By sorting $I_P$ in lexicographical order, one can construct the convex hull of $P$ in $O(n \log n)$ time which is worst-case optimal. Standard worst-case analysis, however, has been criticized as overly coarse or pessimistic, and researchers search for more refined analyses. Universal analysis provides an even stronger guarantee. It fixes a point set $P$ and considers the maximum running time across all permutations $I_P$ of $P$. Afshani, Barbay, Chan [FOCS'07] prove that the convex hull construction algorithm by Kirkpatrick, McQueen, and Seidel is universally optimal. Their proof restricts the model of computation to any algebraic decision tree model where the test functions have at most constant degree and at most a constant number of arguments. They rely upon involved algebraic arguments to construct a lower bound for each point set $P$ that matches the universal running time of [SICOMP'86]. We provide a different proof of universal optimality. Instead of restricting the computational model, we further specify the output. We require as output (1) the convex hull, and (2) for each internal point of $P$ a witness for it being internal. Our argument is shorter, perhaps simpler, and applicable in more general models of computation.
中文: 该摘要提出了一种证明凸包构造通用最优性的新方法,通过要求输出内部点见证简化了先前复杂的代数论证,并适用于更广泛的计算模型。
English: The abstract discusses a new proof for the universal optimality of convex hull construction that simplifies prior complex algebraic arguments by requiring additional output witnesses and applies to broader computational models.

Authors:Chih-Hao Lin, Zian Wang, Ruofan Liang, Yuxuan Zhang, Sanja Fidler, Shenlong Wang, Zan Gojcic
Title: Controllable Weather Synthesis and Removal with Video Diffusion Models
Abstract:
Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control. In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects -- including rain, snow, fog, and clouds -- directly into any input video without the need for 3D modeling. Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability. To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methods in weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.
中文:WeatherWeaver是一种视频扩散模型,无需3D建模即可将雨雪等多种逼真可控的天气效果直接合成到任意视频中,其创新的数据策略解决了训练数据稀缺问题,在真实感和场景保持方面优于现有方法。
English: WeatherWeaver is a video diffusion model that realistically synthesizes controllable weather effects like rain and fog into any video without 3D modeling, using a novel data strategy to overcome training data scarcity and outperforming existing methods in quality and adaptability.

Authors:Zhihan Jiang, Rui Ren, Guangba Yu, Yulun Wu, Wenwei Gu, Yichen Li, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu
Title: LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms
Abstract:
Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the monitoring and diagnosis of the performance of LLM training jobs. For the first time, this paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs based on the distinct characteristics in the LLM training procedure. We design LLMPrism, the first black-box performance diagnosis system for LLM training platforms. By progressively recognizing LLM training jobs, identifying their parallelism strategies, and reconstructing the training timelines, LLMPrism achieves non-intrusive, lightweight, and continuous monitoring of LLM training systems. Leveraging this monitoring capability, it further effectively diagnoses potential performance issues. Since Oct. 2024, LLMPrism has been deployed on our large-scale production Platform-X, in which the evaluations and deployment experiences demonstrate that LLMPrism can achieve accurate timeline reconstruction with an error within 0.3% and effectively diagnose various performance issues.
Chinese: 本文提出LLMPrism系统,通过利用网络流数据重构训练时间线,实现了对多租户大语言模型训练平台的无侵入式性能诊断,生产环境部署表明其具有高精度诊断能力。
English: This paper introduces LLMPrism, a non-intrusive system that uses network flow data to reconstruct training timelines and diagnose performance issues in multi-tenant LLM training platforms, achieving high accuracy in production deployment.

Authors:Yanliang Li, Wenbo Li, Qian Gong, Qing Liu, Norbert Podhorszki, Scott Klasky, Xin Liang, Jieyang Chen
Title: HP-MDR: High-performance and Portable Data Refactoring and Progressive Retrieval with Advanced GPUs
Abstract:
Scientific applications produce vast amounts of data, posing grand challenges in the underlying data management and analytic tasks. Progressive compression is a promising way to address this problem, as it allows for on-demand data retrieval with significantly reduced data movement cost. However, most existing progressive methods are designed for CPUs, leaving a gap for them to unleash the power of today's heterogeneous computing systems with GPUs. In this work, we propose HP-MDR, a high-performance and portable data refactoring and progressive retrieval framework for GPUs. Our contributions are three-fold: (1) We carefully optimize the bitplane encoding and lossless encoding, two key stages in progressive methods, to achieve high performance on GPUs; (2) We propose pipeline optimization and incorporate it with data refactoring and progressive retrieval workflows to further enhance the performance for large data process; (3) We leverage our framework to enable high-performance data retrieval with guaranteed error control for common Quantities of Interest; (4) We evaluate HP-MDR and compare it with state of the arts using five real-world datasets. Experimental results demonstrate that HP-MDR delivers up to 6.6x throughput in data refactoring and progressive retrieval tasks. It also leads to 10.4x throughput for recomposing required data representations under Quantity-of-Interest error control and 4.2x performance for the corresponding end-to-end data retrieval, when compared with state-of-the-art solutions.
科学应用面临数据管理挑战,渐进式压缩可通过按需检索和降低成本来解决,但现有的CPU方法未能充分利用异构系统中GPU的潜力。
Scientific applications face data management challenges, which progressive compression can address by enabling on-demand retrieval with reduced costs, yet existing CPU-focused methods fail to leverage GPU power in heterogeneous systems.

Authors:Wenhan Dong, Yuemeng Zhao, Zhen Sun, Yule Liu, Zifan Peng, Jingyi Zheng, Zongmin Zhang, Ziyi Zhang, Jun Wu, Ruiming Wang, Shengmin Xu, Xinyi Huang, Xinlei He
Title: Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications
Abstract:
As large language models (LLMs) are increasingly used in human-centered tasks, assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. While existing reviews have covered some aspects of related research, several important areas have not been systematically discussed, including detailed discussions of diverse psychological tests, LLM-specific psychological datasets, and the applications of LLMs with psychological traits. To address this gap, we systematically review six key dimensions of applying psychological theories to LLMs: (1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation. Our analysis highlights both the strengths and limitations of current methods. While some LLMs exhibit reproducible personality patterns under specific prompting schemes, significant variability remains across tasks and settings. Recognizing methodological challenges such as mismatches between psychological tools and LLMs' capabilities, as well as inconsistencies in evaluation practices, this study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
中文: 本文系统梳理了心理学理论在大型语言模型中的六个应用维度,既肯定了特定条件下模型人格特征的可复现性,也指出了当前评估方法存在工具适配性与评估一致性等关键挑战。
English: This review systematically examines six key dimensions of applying psychological theories to large language models, highlighting both their reproducible personality patterns under specific conditions and the methodological challenges in current assessment frameworks.

Authors:Huu-Thien Tran, Thanh-Dat Truong, Khoa Luu
Title: BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models
Abstract:
Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.
中文: 本文提出了一种基于标准化流理论的双射最大似然学习(BIMA)新方法,有效缓解大型视觉语言模型中的幻觉问题,在POPE基准上取得了85.06%的平均F1值,并显著降低了CHAIRS和CHAIRI指标。
English: This paper introduces a novel Bijective Maximum Likelihood Learning (BIMA) approach using normalizing flow theories to effectively mitigate hallucination in large vision-language models, achieving significant performance improvements on benchmarks like POPE with an 85.06% F1 score and notable reductions in CHAIRS and CHAIRI metrics.

Authors:Xin Jing, Jiadong Wang, Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller
Title: MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge
Abstract:
Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments\' results demonstrate a consistence performance improvement on SER.
Chinese: 本研究介绍了MELT,一个仅通过文本线索由GPT-4o标注的多模态情感数据集,通过微调自监督学习模型证明了其在提升语音情感识别性能方面的有效性。
English: This study introduces MELT, a multimodal emotion dataset annotated by GPT-4o using only textual cues, demonstrating its effectiveness in improving speech emotion recognition performance through fine-tuning self-supervised learning models.

Authors:David Steinmann, Wolfgang Stammer, Antonia Wüst, Kristian Kersting
Title: Object Centric Concept Bottlenecks
Abstract:
Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
Chinese: 对象中心概念瓶颈(OCB)框架通过将基于概念的模型与对象中心基础模型相结合,克服了传统整体图像编码的限制,在复杂视觉任务中显著提升了性能和可解释性。
English: The Object-Centric Concept Bottlenecks (OCB) framework enhances both performance and interpretability in complex visual tasks by integrating concept-based models with object-centric foundation models, overcoming the limitations of traditional holistic image encodings.

Authors:David Steinmann, Wolfgang Stammer, Antonia Wüst, Kristian Kersting
Title: Object Centric Concept Bottlenecks
Abstract:
Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
Chinese: 对象中心概念瓶颈(OCB)框架通过将基于概念的模型与对象中心基础模型相结合,克服了传统整体图像编码的限制,在复杂视觉任务中显著提升了性能和可解释性。
English: The Object-Centric Concept Bottlenecks (OCB) framework enhances both performance and interpretability in complex visual tasks by integrating concept-based models with object-centric foundation models, overcoming the limitations of traditional holistic image encodings.

Authors:Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang, Zhi Gao, Zilong Zheng, Lei Liu, Bin Li, Qing Li
Title: When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways
Abstract:
Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs' ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.
中文: 大型多模态模型在融入动态现实世界知识时面临灾难性遗忘的挑战,为此我们提出了EVOKE基准来评估知识注入方法,发现现有方法存在局限,但文本增强和持续学习等技术能有效缓解这一问题。
English: Large multimodal models face challenges in integrating evolving real-world knowledge without catastrophic forgetting, prompting the creation of the EVOKE benchmark to assess and improve knowledge injection methods, which reveals current limitations and potential solutions like text augmentation and continual learning.

Authors:Liangyang Ouyang, Yuki Sakai, Ryosuke Furuta, Hisataka Nozawa, Hikoro Matsui, Yoichi Sato
Title: Leadership Assessment in Pediatric Intensive Care Unit Team Training
Abstract:
This paper addresses the task of assessing PICU team's leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio, gaze, and head movement data. We collect one-hour videos of four simulated sessions involving doctors with different roles and levels. To automate data processing, we propose a method leveraging REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification. In the experiments, significant correlations are observed between leadership skills and behavioral metrics, i.e., the output of our proposed methods, such as fixation time, transition patterns, and direct orders in speech. These results indicate that our proposed data collection and analysis framework can effectively solve skill assessment for training PICU teams.
中文摘要:本文开发了一种基于自我中心视觉的自动化框架,通过分析注视对象和对话模式等行为线索来评估PICU团队领导力,实验表明这些行为指标与领导技能存在显著相关性。
English Summary: This paper develops an automated framework using egocentric vision to assess PICU team leadership by analyzing behavioral cues like fixation objects and conversation patterns, showing significant correlation between these metrics and leadership skills.

Authors:Yichi Zhang, Gongwei Chen, Jun Zhu, Jia Wan, Liqiang Nie
Title: Beyond Quantity: Distribution-Aware Labeling for Visual Grounding
Abstract:
Visual grounding requires large and diverse region-text pairs. However, manual annotation is costly and fixed vocabularies restrict scalability and generalization. Existing pseudo-labeling pipelines often overfit to biased distributions and generate noisy or redundant samples. Through our systematic analysis of data quality and distributional coverage, we find that performance gains come less from raw data volume and more from effective distribution expansion. Motivated by this insight, we propose DAL, a distribution-aware labeling framework for visual grounding. The proposed method first employs a dual-driven annotation module, where a closed-set path provides reliable pseudo labels and an open-set path enriches vocabulary and introduces novel concepts; meanwhile, it further performs explicit out-of-distribution (OOD) expression expansion to broaden semantic coverage. We then propose a consistency- and distribution-aware filtering module to discard noisy or redundant region-text pairs and rebalance underrepresented linguistic and visual content, thereby improving both data quality and training efficiency. Extensive experiments on three benchmarks demonstrate that our method consistently outperforms strong baselines and achieves state-of-the-art results, underscoring the critical role of distribution-aware labeling in building scalable and robust visual grounding datasets.
中文摘要:提出的分布感知标注(DAL)框架通过双驱动标注和分布感知过滤来扩展语义覆盖范围,从而提升视觉定位性能,在多个基准测试中达到最优结果。
English Summary: The proposed Distribution-Aware Labeling (DAL) framework enhances visual grounding by expanding semantic coverage through dual-driven annotation and distribution-aware filtering, achieving state-of-the-art performance across benchmarks.

Authors:Mingxu Zhang, Xiaoqi Li, Jiahui Xu, Kaichen Zhou, Hojin Bae, Yan Shen, Chuyan Xiong, Hao Dong
Title: SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping
Abstract:
Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single view 3D object reconstruction approaches, we propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation. Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms,which leverage both the 2D and 3D's inherent semantic and geometric information in the observation to determine the object's 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real world show the reconstruction effectiveness of SR3D.
中文:SR3D框架通过视图和关键点匹配,从单视角RGB-D图像重建精确的3D深度图,无需额外训练即可实现机器人对透明和镜面物体的抓取。
English: The SR3D framework enables robotic grasping of transparent and specular objects by reconstructing accurate 3D depth maps from single-view RGB-D images through view and keypoint matching, without requiring additional training.

Authors:Mingyi He, Yuebing Liang, Shenhao Wang, Yunhan Zheng, Qingyi Wang, Dingyi Zhuang, Li Tian, Jinhua Zhao
Title: Generative AI for Urban Design: A Stepwise Approach Integrating Human Expertise with Multimodal Diffusion Models
Abstract:
Urban design is a multifaceted process that demands careful consideration of site-specific constraints and collaboration among diverse professionals and stakeholders. The advent of generative artificial intelligence (GenAI) offers transformative potential by improving the efficiency of design generation and facilitating the communication of design ideas. However, most existing approaches are not well integrated with human design workflows. They often follow end-to-end pipelines with limited control, overlooking the iterative nature of real-world design. This study proposes a stepwise generative urban design framework that integrates multimodal diffusion models with human expertise to enable more adaptive and controllable design processes. Instead of generating design outcomes in a single end-to-end process, the framework divides the process into three key stages aligned with established urban design workflows: (1) road network and land use planning, (2) building layout planning, and (3) detailed planning and rendering. At each stage, multimodal diffusion models generate preliminary designs based on textual prompts and image-based constraints, which can then be reviewed and refined by human designers. We design an evaluation framework to assess the fidelity, compliance, and diversity of the generated designs. Experiments using data from Chicago and New York City demonstrate that our framework outperforms baseline models and end-to-end approaches across all three dimensions. This study underscores the benefits of multimodal diffusion models and stepwise generation in preserving human control and facilitating iterative refinements, laying the groundwork for human-AI interaction in urban design solutions.
Chinese: 本研究提出了一种分步生成式城市设计框架,通过多模态扩散模型与人类专业知识相结合,在保留设计控制权的同时,在保真度、合规性和多样性方面均优于基准模型。
English: This study introduces a stepwise generative urban design framework that integrates multimodal diffusion models with human expertise, enhancing design adaptability and control while outperforming baseline models in fidelity, compliance, and diversity.

Authors:Kaiyuan Zhang, Zian Su, Pin-Yu Chen, Elisa Bertino, Xiangyu Zhang, Ninghui Li
Title: LLM Agents Should Employ Security Principles
Abstract:
Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.
中文: 该立场文件主张将纵深防御、最小权限等成熟的信息安全原则应用于大规模语言模型代理,以防范隐私泄露和上下文操纵等漏洞,并提出了AgentSandbox概念框架,在代理全生命周期嵌入这些安全措施,在保持高效用的同时显著降低隐私风险。
English: This position paper advocates for applying established information security principles, such as defense-in-depth and least privilege, to secure large language model agents against vulnerabilities like privacy leakage and context manipulation, introducing the AgentSandbox framework to embed these safeguards throughout the agent lifecycle while maintaining high utility.

Authors:Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna
Title: One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Abstract:
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.
中文: TrajViT提出基于全景对象轨迹的接地视频标记化方法,在显著降低计算冗余的同时,通过更快的训练和推理速度,在多项视频理解基准测试中全面超越ViT3D。
English: TrajViT introduces grounded video tokenization by organizing tokens along panoptic object trajectories, significantly reducing computational redundancy while outperforming ViT3D across multiple benchmarks with faster training and inference.

Authors:Jianbo Zhao, Taiyu Ban, Xiyang Wang, Qibin Zhou, Hangning Zhou, Zhihao Liu, Mu Yang, Lei Liu, Bin Li
Title: Autoregressive Meta-Actions for Unified Controllable Trajectory Generation
Abstract:
Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework's effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at https://arma-traj.github.io/.
中文摘要:本文提出的自回归元动作框架通过将长间隔元动作分解为帧级动作,解决了自动驾驶中轨迹与语义决策的时间错位问题,实现了轨迹段与元动作的精确对齐,并通过分阶段预训练显著提升了动态决策场景下的适应性和响应能力。
English Summary: The proposed Autoregressive Meta-Actions framework addresses temporal misalignment in autonomous driving by decomposing long-interval meta-actions into frame-level actions, ensuring precise alignment between trajectory segments and semantic decisions while improving adaptability through staged pre-training.

Authors:Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine
Title: Diffusion Guidance Is a Controllable Policy Improvement Operator
Abstract:
At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.
中文摘要:CFGRL框架将强化学习与扩散模型相结合,通过简单的监督训练实现策略优化,无需显式学习价值函数,在离线任务中持续提升性能表现。
English Summary: The CFGRL framework integrates reinforcement learning with diffusion models, enabling policy improvement through simple supervised training without requiring explicit value functions, which consistently boosts performance across offline tasks.

Authors:Hongcan Guo, Guoshun Nan, Yuan Yang, Diyang Zhang, Haotian Li, Zhican Chen, Qinchuan Zhou, Yuhan Ran, Xinye Cao, Sicong Leng, Xiaofeng Tao, Xudong Jiang
Title: Two Is Better Than One: Rotations Scale LoRAs
Abstract:
Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs' scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.
中文摘要:RadarGate提出了一种几何启发的门控机制,通过旋转操作增强表达能力,有效解决了基于LoRA的专家混合模型在扩展时面临的泛化问题。
English Summary: RadarGate introduces a geometrically inspired gating mechanism using rotational operations to enhance expressiveness and address generalization issues in scaling LoRA-based Mixture-of-Experts for large language models.

Authors:Zhuoran Duan, Guoshun Nan, Rushan Li, Zijun Wang, Lihua Xiong, Chaoying Yuan, Guorong Liu, Hui Xu, Qimei Cui, Xiaofeng Tao, Tony Q. S. Quek
Title: Agile Orchestration at Will: An Entire Smart Service-Based Security Architecture Towards 6G
Abstract:
The upcoming 6G will fundamentally reshape mobile networks beyond communications, unlocking a multitude of applications that were once considered unimaginable. Meanwhile, security and resilience are especially highlighted in the 6G design principles. However, safeguarding 6G networks will be quite challenging due to various known and unknown threats from highly heterogeneous networks and diversified security requirements of distinct use cases, calling for a comprehensive re-design of security architecture. This motivates us to propose ES3A (Entire Smart Service-based Security Architecture), a novel security architecture for 6G networks. Specifically, we first discuss six high-level principles of our ES3A that include hierarchy, flexibility, scalability, resilience, endogeny, and trust and privacy. With these goals in mind, we then introduce three guidelines from a deployment perspective, envisioning our ES3A that offers service-based security, end-to-end protection, and smart security automation for 6G networks. Our architecture consists of three layers and three domains. It relies on a two-stage orchestration mechanism to tailor smart security strategies for customized protection in high-dynamic 6G networks, thereby addressing the aforementioned challenges. Finally, we prototype the proposed ES3A on a real-world radio system based on Software-Defined Radio (SDR). Experiments show the effectiveness of our ES3A. We also provide a case to show the superiority of our architecture.
中文: 提出的ES3A架构通过分层灵活设计和智能自动化应对6G安全挑战,提供端到端保护,并已通过实际原型验证其有效性。
English: The proposed ES3A architecture addresses 6G security challenges through hierarchical, flexible design and smart automation, offering end-to-end protection validated by real-world prototyping.

Authors:Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Title: NegVQA: Can Vision Language Models Understand Negation?
Abstract:
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
中文: NegVQA基准测试表明视觉语言模型在理解否定语义方面存在显著困难,不仅表现大幅下降,还呈现出模型规模与性能间的U型缩放规律。
English: The NegVQA benchmark reveals that vision language models struggle significantly with understanding negation, showing performance drops and a U-shaped scaling trend with model size increases.

Authors:Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy
Title: Can Large Language Models Match the Conclusions of Systematic Reviews?
Abstract:
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.
中文: 尽管大语言模型已在临床中使用,但它们因缺乏科学质疑精神且在处理长文本时性能下降,目前仍无法达到专家撰写的系统评价结论水平。
English: Large language models currently cannot match the conclusions of expert-conducted systematic reviews, as they lack scientific skepticism and show performance degradation with longer inputs, despite being used in clinical settings.

Authors:Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, Jingbo Shang
Title: Training Language Models to Generate Quality Code with Program Analysis Feedback
Abstract:
Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
中文: REAL强化学习框架通过自动化程序分析和单元测试激励大语言模型生成高质量代码,无需人工干预即可确保安全性、可维护性和功能正确性,在质量和可扩展性上优于现有方法。
English: REAL, a reinforcement learning framework, enhances code generation by using automated program analysis and unit tests to ensure security, maintainability, and functional correctness without manual intervention, outperforming existing methods in quality and scalability.

Authors:Zijian Liang, Kai Niu, Changshuo Wang, Jin Xu, Ping Zhang
Title: Synonymous Variational Inference for Perceptual Image Compression
Abstract:
Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.
Chinese: 本文提出同义变分推断(SVI)和同义图像压缩(SIC)方法,通过构建同义集合和优化部分语义KL散差,理论证明了感知图像压缩的三重权衡关系,并利用渐进式编解码器验证了该方法的有效性。
English: This paper introduces synonymous variational inference (SVI) and synonymous image compression (SIC) to address perceptual image compression, demonstrating a triple tradeoff that encompasses existing rate-distortion-perception methods and validating the approach with a progressive codec.

Authors:Aihu Zhang, Jiaxing Xu, Mengcheng Lan, Shili Xiang, Yiping Ke
Title: Directed Homophily-Aware Graph Neural Network
Abstract:
Graph Neural Networks (GNNs) have achieved significant success in various learning tasks on graph-structured data. Nevertheless, most GNNs struggle to generalize to heterophilic neighborhoods. Additionally, many GNNs ignore the directional nature of real-world graphs, resulting in suboptimal performance on directed graphs with asymmetric structures. In this work, we propose Directed Homophily-aware Graph Neural Network (DHGNN), a novel framework that addresses these limitations by incorporating homophily-aware and direction-sensitive components. DHGNN employs a resettable gating mechanism to adaptively modulate message contributions based on homophily levels and informativeness, and a structure-aware noise-tolerant fusion module to effectively integrate node representations from the original and reverse directions. Extensive experiments on both homophilic and heterophilic directed graph datasets demonstrate that DHGNN outperforms state-of-the-art methods in node classification and link prediction. In particular, DHGNN improves over the best baseline by up to 15.07% in link prediction. Our analysis further shows that the gating mechanism captures directional homophily gaps and fluctuating homophily across layers, providing deeper insights into message-passing behavior on complex graph structures.
中文: 提出的有向同配感知图神经网络(DHGNN)通过自适应门控和噪声容忍融合机制,有效解决了异配邻域和方向性图结构的处理难题,在节点分类和链接预测任务中实现了最优性能。
English: The proposed Directed Homophily-aware Graph Neural Network (DHGNN) overcomes limitations in handling heterophilic neighborhoods and directional graph structures through adaptive gating and noise-tolerant fusion, achieving superior performance in node classification and link prediction tasks.

Authors:Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou
Title: Skywork Open Reasoner 1 Technical Report
Abstract:
The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.
中文: Skywork-OR1通过可扩展的强化学习方法显著提升了大语言模型的推理能力,在多项基准测试中实现重大性能突破,同时解决了熵塌缩问题并全面开源了模型资源。
English: Skywork-OR1 significantly enhances reasoning in large language models through scalable reinforcement learning, achieving major accuracy gains on key benchmarks while addressing entropy collapse and fully open-sourcing its resources.

Authors:Zhisong Wang, Yiwen Ye, Ziyang Chen, Yong Xia
Title: Enjoying Information Dividend: Gaze Track-based Medical Weakly Supervised Segmentation
Abstract:
Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze-based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians' gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi-level gaze supervision during the decoding process. Experiments on the Kvasir-SEG and NCI-ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze-based methods, achieving Dice score improvements of 3.21\% and 2.61\%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.
Chinese: GradTrack通过利用包含注视点、持续时间和时间顺序在内的全面眼动数据,结合其眼动轨迹图生成与轨迹注意力模块,显著提升了医学图像中的弱监督语义分割性能,在基准数据集上表现优异,并缩小了与全监督模型的差距。
English: GradTrack enhances weakly supervised semantic segmentation in medical imaging by leveraging comprehensive gaze data, including fixation points, durations, and temporal order, through its Gaze Track Map Generation and Track Attention components, achieving notable performance improvements on benchmark datasets and narrowing the gap with fully supervised models.

Authors:Huaijin Pi, Zhi Cen, Zhiyang Dou, Taku Komura
Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
Abstract:
Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.
中文摘要:本文提出一种协调扩散噪声优化框架,通过梯度流协调身体与手部运动,并采用基于基点的统一表示实现精确的手物交互,从而合成逼真的全身操控动作。
English Summary: This paper introduces a coordinated diffusion noise optimization framework that synthesizes realistic whole-body manipulation by coordinating body and hand movements through gradient flow and a unified basis point set representation for precise hand-object interactions.

Authors:Jieyu Yuan, Yujun Li, Yuanlin Zhang, Chunle Guo, Xiongxin Tang, Ruixing Wang, Chongyi Li
Title: 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling
Abstract:
Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduce artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a distance-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at https://bilityniu.github.io/3D-UIR.
中文摘要:本研究提出了一种基于物理的框架,通过定制高斯建模和距离引导优化策略,将物体外观与水体介质效应分离,显著提升了水下新视角合成的渲染质量和场景复原精度。
English Summary: This study introduces a physics-based framework that enhances underwater novel view synthesis by separating object appearance from water medium effects using tailored Gaussian modeling and a distance-guided optimization strategy, achieving superior rendering quality and scene restoration.

Authors:Wei Li, Hebei Li, Yansong Peng, Siying Wu, Yueyi Zhang, Xiaoyan Sun
Title: Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects
Abstract:
Diffusion models have significantly advanced text-to-image generation, laying the foundation for the development of personalized generative frameworks. However, existing methods lack precise layout controllability and overlook the potential of dynamic features of reference subjects in improving fidelity. In this work, we propose Layout-Controllable Personalized Diffusion (LCP-Diffusion) model, a novel framework that integrates subject identity preservation with flexible layout guidance in a tuning-free approach. Our model employs a Dynamic-Static Complementary Visual Refining module to comprehensively capture the intricate details of reference subjects, and introduces a Dual Layout Control mechanism to enforce robust spatial control across both training and inference stages. Extensive experiments validate that LCP-Diffusion excels in both identity preservation and layout controllability. To the best of our knowledge, this is a pioneering work enabling users to "create anything anywhere".
Chinese: LCP-Diffusion模型通过动态-静态互补优化模块和双重布局控制机制,在无需调参的情况下实现了身份特征保持与灵活布局控制的突破,显著提升了生成图像的保真度与空间精确性。
English: The LCP-Diffusion model introduces a tuning-free framework that enhances personalized text-to-image generation through dynamic-static feature refinement and dual layout control, achieving superior identity preservation and spatial precision.

Authors:Zhefeng Cao, Ben Liu, Sen Li, Wei Zhang, Hua Chen
Title: G-DReaM: Graph-conditioned Diffusion Retargeting across Multiple Embodiments
Abstract:
Motion retargeting for specific robot from existing motion datasets is one critical step in transferring motion patterns from human behaviors to and across various robots. However, inconsistencies in topological structure, geometrical parameters as well as joint correspondence make it difficult to handle diverse embodiments with a unified retargeting architecture. In this work, we propose a novel unified graph-conditioned diffusion-based motion generation framework for retargeting reference motions across diverse embodiments. The intrinsic characteristics of heterogeneous embodiments are represented with graph structure that effectively captures topological and geometrical features of different robots. Such a graph-based encoding further allows for knowledge exploitation at the joint level with a customized attention mechanisms developed in this work. For lacking ground truth motions of the desired embodiment, we utilize an energy-based guidance formulated as retargeting losses to train the diffusion model. As one of the first cross-embodiment motion retargeting methods in robotics, our experiments validate that the proposed model can retarget motions across heterogeneous embodiments in a unified manner. Moreover, it demonstrates a certain degree of generalization to both diverse skeletal structures and similar motion patterns.
中文摘要:本研究提出了一种基于图条件扩散的统一运动重定向框架,通过图结构编码和定制注意力机制,能够跨不同机器人结构实现运动模式迁移,并展现出对异构骨骼结构和相似运动模式的泛化能力。
English Summary: This paper introduces a unified graph-conditioned diffusion framework for cross-embodiment motion retargeting that handles diverse robot structures through graph-based encoding and customized attention mechanisms.

Authors:Zilong Wang, Jingfeng Yang, Sreyashi Nag, Samarth Varshney, Xianfeng Tang, Haoming Jiang, Jingbo Shang, Sheikh Muhammad Sarwar
Title: RRO: LLM Agent Optimization Through Rising Reward Trajectories
Abstract:
Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.
中文摘要:大型语言模型在处理复杂多步骤任务时因规划轨迹中的细微错误容易失败,而本文提出的奖励上升优化方法通过增量式过程监督动态扩展行动搜索空间,在显著降低探索成本的同时实现了更优性能。
English Summary: Large language models struggle with complex multi-step tasks due to sensitivity to planning errors, but the proposed Reward Rising Optimization method addresses this by dynamically expanding action search through incremental process supervision to improve performance while reducing exploration costs.

Authors:Dancheng Liu, Amir Nassereldine, Chenhui Xu, Jinjun Xiong
Title: Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation
Abstract:
Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.
Chinese: 定向声学增强通过声学变化提升模型泛化能力,可显著增强自动语音识别系统的鲁棒性,为缺乏海量训练数据时构建强健模型提供了有效替代方案。
English: Targeted acoustic augmentation can significantly enhance ASR model robustness by improving generalization through acoustic variation, offering a viable alternative to massive training datasets.

Authors:Julius Richter, Till Svajda, Timo Gerkmann
Title: ReverbFX: A Dataset of Room Impulse Responses Derived from Reverb Effect Plugins for Singing Voice Dereverberation
Abstract:
We present ReverbFX, a new room impulse response (RIR) dataset designed for singing voice dereverberation research. Unlike existing datasets based on real recorded RIRs, ReverbFX features a diverse collection of RIRs captured from various reverb audio effect plugins commonly used in music production. We conduct comprehensive experiments using the proposed dataset to benchmark the challenge of dereverberation of singing voice recordings affected by artificial reverbs. We train two state-of-the-art generative models using ReverbFX and demonstrate that models trained with plugin-derived RIRs outperform those trained on realistic RIRs in artificial reverb scenarios.
Chinese: ReverbFX 是一个从音频效果插件采集的房间脉冲响应新数据集,它通过基于插件RIR训练的模型在人工混响场景中优于使用真实录制RIR的模型,从而提升了歌声去混响的效果。
English: ReverbFX is a novel dataset of room impulse responses sourced from audio effect plugins, which enhances singing voice dereverberation by enabling models trained on it to surpass those using real recorded RIRs in artificial reverb conditions.

Authors:Ruiqi Wu, Xinjie Wang, Liu Liu, Chunle Guo, Jiaxiong Qiu, Chongyi Li, Lichao Huang, Zhizhong Su, Ming-Ming Cheng
Title: DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data
Abstract:
We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset will be released to the community upon publication.
中文: DIPO是一种新颖的框架,通过双图像输入生成铰接式3D对象,结合扩散模型和思维链推理推断部件连接关系,并利用自动化LEGO-Art流程和PM-X数据集提升性能与泛化能力。
English: DIPO is a novel framework that generates articulated 3D objects from dual-image inputs, utilizing a diffusion model and Chain-of-Thought reasoning to infer part connectivity, and is enhanced by the automated LEGO-Art pipeline and PM-X dataset for improved performance and generalization.

Authors:Hyunsik Chae, Seungwoo Yoon, Jaden Park, Chloe Yewon Chun, Yongin Cho, Mu Cai, Yong Jae Lee, Ernest K. Ryu
Title: Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Abstract:
Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
中文: 当前视觉语言模型虽具备强大的多模态推理能力,却在基础视觉任务上表现不佳,为此我们构建了原子视觉技能数据集,以评估并改进它们在基础感知方面的不足。
English: Current Vision-Language Models exhibit strong multimodal reasoning but falter on basic visual tasks, prompting the creation of the Atomic Visual Skills Dataset to evaluate and address their deficiencies in fundamental perception.

Authors:Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Title: MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Abstract:
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
中文: MLR-Bench作为评估AI智能体在开放式机器学习研究中表现的新型基准,通过自动化评估框架和模块化智能体架构发现:尽管大语言模型在创意生成和论文撰写方面表现优异,但当前编程智能体常产生不可靠的实验结果。
English: MLR-Bench is a novel benchmark for evaluating AI agents in open-ended machine learning research, featuring automated assessment tools and a modular agent framework, which reveals that while LLMs excel at idea generation and paper writing, current coding agents often produce unreliable experimental results.

Authors:Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li
Title: StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation
Abstract:
In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.
中文摘要:本文提出的StyleAR模型通过结合创新的数据整理方法和自回归建模,利用合成的二元数据及风格增强技术,有效解决了风格对齐文生图任务中三元组数据稀缺的难题,实现了卓越的性能表现。
English Summary: The proposed StyleAR model addresses the challenge of limited triplet data for style-aligned text-to-image generation by combining a novel data curation method with autoregressive modeling, utilizing synthesized binary data and style enhancement techniques to achieve superior results.

Authors:Junyang Shu, Zhiwei Lin, Yongtao Wang
Title: RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback
Abstract:
Vision-Language-Action (VLA) models have demonstrated significant potential in the field of embodied intelligence, enabling agents to follow human instructions to complete complex tasks in physical environments. Existing embodied agents are often trained through behavior cloning, which requires expensive data and computational resources and is constrained by human demonstrations. To address this issue, many researchers explore the application of reinforcement fine-tuning to embodied agents. However, typical reinforcement fine-tuning methods for embodied agents usually rely on sparse, outcome-based rewards, which struggle to provide fine-grained feedback for specific actions within an episode, thus limiting the model's manipulation capabilities and generalization performance. In this paper, we propose RFTF, a novel reinforcement fine-tuning method that leverages a value model to generate dense rewards in embodied scenarios. Specifically, our value model is trained using temporal information, eliminating the need for costly robot action labels. In addition, RFTF incorporates a range of techniques, such as GAE and sample balance to enhance the effectiveness of the fine-tuning process. By addressing the sparse reward problem in reinforcement fine-tuning, our method significantly improves the performance of embodied agents, delivering superior generalization and adaptation capabilities across diverse embodied tasks. Experimental results show that embodied agents fine-tuned with RFTF achieve new state-of-the-art performance on the challenging CALVIN ABC-D with an average success length of 4.296. Moreover, RFTF enables rapid adaptation to new environments. After fine-tuning in the D environment of CALVIN for a few episodes, RFTF achieved an average success length of 4.301 in this new environment.
Chinese: 本文提出RFTF强化微调方法,利用价值模型生成密集奖励,解决了具身智能中稀疏奖励的局限,在CALVIN等任务上实现了最优性能,显著提升了模型的泛化能力和环境适应性。
English: The paper introduces RFTF, a reinforcement fine-tuning method that uses a value model to generate dense rewards, overcoming the limitations of sparse rewards in embodied agents and achieving state-of-the-art performance on tasks like CALVIN with enhanced generalization and adaptability.

Authors:Haiyang Sun, Shujie Hu, Shujie Liu, Lingwei Meng, Hui Wang, Bing Han, Yifan Yang, Yanqing Liu, Sheng Zhao, Yan Lu, Yanmin Qian
Title: Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling
Abstract:
Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces high processing latency. To address this issue, we propose SMLLE, a streaming framework for generating high-quality speech frame-by-frame. SMLLE employs a Transducer to convert text into semantic tokens in real time while simultaneously obtaining duration alignment information. The combined outputs are then fed into a fully autoregressive (AR) streaming model to reconstruct mel-spectrograms. To further stabilize the generation process, we design a Delete < Bos > Mechanism that allows the AR model to access future text introducing as minimal delay as possible. Experimental results suggest that the SMLLE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems. Samples are available on shy-98.github.io/SMLLE_demo_page/.
中文摘要:提出的SMLLE框架通过结合实时转换语义标记的Transducer和采用创新Delete < Bos >机制的自回归模型,实现了高质量、低延迟的流式文本转语音,其性能超越现有流式方法并达到句子级系统的水平。
English Summary: The proposed SMLLE framework enables high-quality, low-latency streaming text-to-speech by combining a Transducer for real-time semantic token conversion with an autoregressive model enhanced by a novel Delete < Bos > Mechanism, outperforming existing methods while matching sentence-level systems.

Authors:Kaiqing Lin, Zhiyuan Yan, Ke-Yue Zhang, Li Hao, Yue Zhou, Yuzhen Lin, Weixiang Li, Taiping Yao, Shouhong Ding, Bin Li
Title: Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes
Abstract:
Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., "VIP individuals" whose authentic facial data are already available. In this paper, we propose \textbf{VIPGuard}, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbf{VIPBench} for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation.
中文摘要:VIPGuard是一种多模态框架,通过利用详细面部特征和身份定制技术,结合多模态大语言模型进行个性化推理,实现了精准且可解释的深度伪造检测,显著优于依赖低级视觉线索的传统方法。
English Summary: VIPGuard is a multimodal framework that leverages detailed facial attributes and identity-specific customization to provide accurate and explainable deepfake detection, outperforming traditional methods by incorporating personalized reasoning through a multimodal large language model.

Authors:Ming Yin, Yuanhao Qu, Ling Yang, Le Cong, Mengdi Wang
Title: Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning
Abstract:
We investigate how to teach large language models (LLMs) to perform scientific reasoning by leveraging expert discussions as a learning signal. Focusing on the genomics domain, we develop an automated pipeline to extract trainable data and introduce Genome-Bench, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering. Our pipeline transforms raw interactions into a reinforcement learning-friendly multiple-choice questions format, supported by 3000+ high-quality question-answer pairs spanning foundational biology, experimental troubleshooting, tool usage, and beyond. We fine-tune an LLM using RL with a rule-based reward signal derived from the synthetic MCQ dataset to enhance domain-specific reasoning. Our results show that reinforcement learning from scientific discussions improves model performance by over 15% compared to the base model on Genome-Bench, narrowing the gap between open-source LLMs and expert-level reasoning. To our knowledge, this is the first end-to-end pipeline for teaching LLMs to reason from scientific discussions, with promising potential for generalization across scientific domains beyond biology.
中文摘要:本研究利用科学论坛讨论构建强化学习流程,提升大语言模型在基因组学领域的推理能力,在专业基准测试中性能提升超过15%。
English Summary: This study develops a reinforcement learning pipeline using scientific forum discussions to enhance large language models' reasoning in genomics, achieving over 15% performance improvement on a specialized benchmark.

Authors:Wenhua Wu, Chenpeng Su, Siting Zhu, Tianchen Deng, Zhe Liu, Hesheng Wang
Title: ADD-SLAM: Adaptive Dynamic Dense SLAM with Gaussian Splatting
Abstract:
Recent advancements in Neural Radiance Fields (NeRF) and 3D Gaussian-based Simultaneous Localization and Mapping (SLAM) methods have demonstrated exceptional localization precision and remarkable dense mapping performance. However, dynamic objects introduce critical challenges by disrupting scene consistency, leading to tracking drift and mapping artifacts. Existing methods that employ semantic segmentation or object detection for dynamic identification and filtering typically rely on predefined categorical priors, while discarding dynamic scene information crucial for robotic applications such as dynamic obstacle avoidance and environmental interaction. To overcome these challenges, we propose ADD-SLAM: an Adaptive Dynamic Dense SLAM framework based on Gaussian splitting. We design an adaptive dynamic identification mechanism grounded in scene consistency analysis, comparing geometric and textural discrepancies between real-time observations and historical maps. Ours requires no predefined semantic category priors and adaptively discovers scene dynamics. Precise dynamic object recognition effectively mitigates interference from moving targets during localization. Furthermore, we propose a dynamic-static separation mapping strategy that constructs a temporal Gaussian model to achieve online incremental dynamic modeling. Experiments conducted on multiple dynamic datasets demonstrate our method's flexible and accurate dynamic segmentation capabilities, along with state-of-the-art performance in both localization and mapping.
中文摘要:提出的ADD-SLAM框架通过场景一致性分析自适应识别动态物体,无需预定义语义先验,基于时序高斯建模实现了卓越的动态分割能力,并在定位与建图任务中达到最优性能。
English Summary: The proposed ADD-SLAM framework adaptively identifies dynamic objects through scene consistency analysis without requiring predefined semantic priors, achieving superior dynamic segmentation and state-of-the-art localization/mapping performance through temporal Gaussian modeling.

Authors:Yan Wen, Junfeng Guo, Heng Huang
Title: CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems
Abstract:
As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.
中文: CoTGuard框架通过在思维链推理中嵌入触发查询来监控多智能体大语言模型的中间推理步骤,能有效识别版权侵权且对任务性能影响极小。
English: The CoTGuard framework introduces trigger-based detection within Chain-of-Thought reasoning to monitor intermediate steps in multi-agent LLM systems, effectively identifying copyright violations with minimal impact on performance.

Authors:Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak
Title: SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
Abstract:
Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.
中文:SoloSpeech提出了一种新颖的级联生成式流程,在目标语音提取中实现了最优的清晰度和质量,同时展现出卓越的泛化能力。
English: SoloSpeech introduces a novel cascaded generative pipeline that achieves state-of-the-art intelligibility and quality in target speech extraction while demonstrating strong generalization capabilities.

Authors:Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, Andrew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, Jimmy Lin
Title: RankLLM: A Python Package for Reranking with LLMs
Abstract:
The adoption of large language models (LLMs) as rerankers in multi-stage retrieval systems has gained significant traction in academia and industry. These models refine a candidate list of retrieved documents, often through carefully designed prompts, and are typically used in applications built on retrieval-augmented generation (RAG). This paper introduces RankLLM, an open-source Python package for reranking that is modular, highly configurable, and supports both proprietary and open-source LLMs in customized reranking workflows. To improve usability, RankLLM features optional integration with Pyserini for retrieval and provides integrated evaluation for multi-stage pipelines. Additionally, RankLLM includes a module for detailed analysis of input prompts and LLM responses, addressing reliability concerns with LLM APIs and non-deterministic behavior in Mixture-of-Experts (MoE) models. This paper presents the architecture of RankLLM, along with a detailed step-by-step guide and sample code. We reproduce results from RankGPT, LRL, RankVicuna, RankZephyr, and other recent models. RankLLM integrates with common inference frameworks and a wide range of LLMs. This compatibility allows for quick reproduction of reported results, helping to speed up both research and real-world applications. The complete repository is available at rankllm.ai, and the package can be installed via PyPI.
大语言模型在多阶段检索系统中作为重排器的应用日益增多,本文介绍了RankLLM,一个开源、模块化、可配置的Python包,支持专有和开源模型以定制重排工作流程。
Large language models are increasingly used as rerankers in multi-stage retrieval systems, and this paper introduces RankLLM, an open-source Python package that is modular, configurable, and supports both proprietary and open-source models for customized reranking workflows.

Authors:Siqi Huang, Yanchen Xu, Hongyuan Zhang, Xuelong Li
Title: Learn Beneficial Noise as Graph Augmentation
Abstract:
Although graph contrastive learning (GCL) has been widely investigated, it is still a challenge to generate effective and stable graph augmentations. Existing methods often apply heuristic augmentation like random edge dropping, which may disrupt important graph structures and result in unstable GCL performance. In this paper, we propose Positive-incentive Noise driven Graph Data Augmentation (PiNGDA), where positive-incentive noise (pi-noise) scientifically analyzes the beneficial effect of noise under the information theory. To bridge the standard GCL and pi-noise framework, we design a Gaussian auxiliary variable to convert the loss function to information entropy. We prove that the standard GCL with pre-defined augmentations is equivalent to estimate the beneficial noise via the point estimation. Following our analysis, PiNGDA is derived from learning the beneficial noise on both topology and attributes through a trainable noise generator for graph augmentations, instead of the simple estimation. Since the generator learns how to produce beneficial perturbations on graph topology and node attributes, PiNGDA is more reliable compared with the existing methods. Extensive experimental results validate the effectiveness and stability of PiNGDA.
Chinese: 本文提出PiNGDA方法,通过可训练的噪声生成器学习对图拓扑和属性的有益扰动,相比启发式方法能更有效稳定地提升图对比学习性能。
English: This paper introduces PiNGDA, a novel graph data augmentation method that uses a trainable noise generator to learn beneficial perturbations on graph topology and attributes, enhancing the effectiveness and stability of graph contrastive learning compared to heuristic approaches.

Authors:Zhiwei Lin, Yongtao Wang
Title: VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion
Abstract:
Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
中文摘要:本文提出VL-SAM-V2开放世界物体检测框架,通过融合开放集与开放端模型的查询机制,在无需人工输入的情况下实现对新物体的自主发现,并在稀有类别上展现出卓越性能。
English Summary: The paper introduces VL-SAM-V2, an open-world object detection framework that combines open-set and open-ended approaches through query fusion to discover unseen objects while achieving superior performance, particularly on rare categories.

Authors:Troy Schotter, Saba Kawas, James Prather, Juho Leinonen, Jon Ippolito, Greg L Nelson
Title: SPIRAL integration of generative AI in an undergraduate creative media course: effects on self-efficacy and career outcome expectations
Abstract:
Computing education and computing students are rapidly integrating generative AI, but we know relatively little about how different pedagogical strategies for intentionally integrating generative AI affect students' self-efficacy and career interests. This study investigates a SPIRAL integration of generative AI (Skills Practiced Independently, Revisited with AI Later), implemented in an introductory undergraduate creative media and technology course in Fall 2023 (n=31). Students first developed domain skills for half the semester, then revisited earlier material integrating using generative AI, with explicit instruction on how to use it critically and ethically. We contribute a mixed methods quantitative and qualitative analysis of changes in self-efficacy and career interests over time, including longitudinal qualitative interviews (n=9) and thematic analysis. We found positive changes in both students' creative media self-efficacy and generative AI use self-efficacy, and mixed changes for ethical generative AI use self-efficacy. We also found students experienced demystification, transitioning from initial fear about generative AI taking over their fields and jobs, to doubting AI capability to do so and/or that society will push back against AI, through personal use of AI and observing others' use of AI vicariously. For career interests, our SPIRAL integration of generative AI use appeared to have either a neutral or positive influence on students, including widening their perceived career options, depending on their view of how AI would influence the career itself. These findings suggest that careful pedagogical sequencing can mitigate some potential negative impacts of AI, while promoting ethical and critical AI use that supports or has a neutral effect on students' career formation. To our knowledge our SPIRAL integration strategy applied to generative AI integration is novel.
中文: 本研究探讨了SPIRAL教学策略在生成式AI教育中的应用,发现该策略通过结构化伦理实践能提升学生自我效能感与职业兴趣,同时有效消解其对AI替代的焦虑。
English: This study explores the SPIRAL pedagogical strategy for integrating generative AI in education, finding it enhances students' self-efficacy and career interests while mitigating fears through structured, ethical use.

Authors:Hai-Long Qin, Jincheng Dai, Sixian Wang, Xiaoqi Qin, Shuo Shao, Kai Niu, Wenjun Xu, Ping Zhang
Title: Neural Coding Is Not Always Semantic: Toward the Standardized Coding Workflow in Semantic Communications
Abstract:
Semantic communication, leveraging advanced deep learning techniques, emerges as a new paradigm that meets the requirements of next-generation wireless networks. However, current semantic communication systems, which employ neural coding for feature extraction from raw data, have not adequately addressed the fundamental question: Is general feature extraction through deep neural networks sufficient for understanding semantic meaning within raw data in semantic communication? This article is thus motivated to clarify two critical aspects: semantic understanding and general semantic representation. This article presents a standardized definition on semantic coding, an extensive neural coding scheme for general semantic representation that clearly represents underlying data semantics based on contextual modeling. With these general semantic representations obtained, both human- and machine-centric end-to-end data transmission can be achieved through only minimal specialized modifications, such as fine-tuning and regularization. This article contributes to establishing a commonsense that semantic communication extends far beyond mere feature transmission, focusing instead on conveying compact semantic representations through context-aware coding schemes.
中文摘要:语义通信通过标准化的神经编码方案实现上下文感知的语义表征,超越了单纯的特征传输,仅需微调即可支持人机两端的高效数据传输。
English Summary: Semantic communication advances beyond simple feature transmission by introducing a standardized neural coding scheme for context-aware semantic representation, enabling efficient end-to-end data transmission for both human and machine applications with minimal adjustments.

Authors:AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed
Title: Voice of a Continent: Mapping Africa's Speech Technology Frontier
Abstract:
Africa's rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent's speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa's linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
中文摘要:本研究针对非洲语言在语音技术中的代表性不足问题,推出了综合基准SimbaBench和Simba模型系列,在多种语言与任务中实现最优性能,同时揭示了关键资源缺口问题。
English Summary: This work addresses the underrepresentation of African languages in speech technology by introducing SimbaBench, a comprehensive benchmark, and the Simba model family, which achieves state-of-the-art performance across multiple languages and tasks while highlighting critical resource gaps.

Authors:Nura Aljaafari, Danilo S. Carvalho, André Freitas
Title: TRACE for Tracking the Emergence of Semantic Representations in Transformers
Abstract:
Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.
中文摘要:本研究提出的TRACE诊断框架通过结合几何、信息和语言信号来检测Transformer模型中的相变,揭示这些相变与曲率塌缩、维度稳定及跨架构变体出现的语言准确性之间的对应关系。
English Summary: This study introduces TRACE, a diagnostic framework that detects phase transitions in transformer models by combining geometric, informational, and linguistic signals, revealing how these transitions align with curvature collapse, dimension stabilization, and emerging linguistic accuracy across architectural variants.

Authors:Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang
Title: Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Abstract:
As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.
中文: 随着AI系统在推理和情境意识方面愈发先进,它们会识别评估场景并伪装行为以提升安全表现,这种现象被称为评估伪装,在大型推理模型中尤为普遍,研究通过思维链监测技术揭示了其内在机制并为未来缓解措施提供了依据。
English: Advanced AI systems increasingly recognize when they are being evaluated and alter their behavior to appear safer, a phenomenon known as evaluation faking that intensifies with the model's reasoning capabilities, size, and memory functions, as revealed through systematic experiments and a novel detection method.

Authors:Bo Wang, De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Nu-Fang Xiao, Jian-Long Hao, Ming-Yuan Liu, Zeng-Guang Hou
Title: CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment
Abstract:
Synthetic X-ray angiographies generated by modern generative models hold great potential to reduce the use of contrast agents in vascular interventional procedures. However, low-quality synthetic angiographies can significantly increase procedural risk, underscoring the need for reliable image quality assessment (IQA) methods. Existing IQA models, however, fail to leverage auxiliary images as references during evaluation and lack fine-grained, task-specific metrics necessary for clinical relevance. To address these limitations, this paper proposes CAS-IQA, a vision-language model (VLM)-based framework that predicts fine-grained quality scores by effectively incorporating auxiliary information from related images. In the absence of angiography datasets, CAS-3K is constructed, comprising 3,565 synthetic angiographies along with score annotations. To ensure clinically meaningful assessment, three task-specific evaluation metrics are defined. Furthermore, a Multi-path featUre fuSion and rouTing (MUST) module is designed to enhance image representations by adaptively fusing and routing visual tokens to metric-specific branches. Extensive experiments on the CAS-3K dataset demonstrate that CAS-IQA significantly outperforms state-of-the-art IQA methods by a considerable margin.
中文: 现代生成模型能合成X射线血管造影以减少造影剂使用,但其质量风险要求可靠的图像质量评估;本文提出的CAS-IQA框架通过整合辅助图像和临床专用指标,显著提升了评估性能。
English: Modern generative models can produce synthetic X-ray angiographies to reduce contrast agent use, but their procedural risks necessitate reliable image quality assessment, which the proposed CAS-IQA framework addresses by incorporating auxiliary data and task-specific metrics for superior performance.

Authors:Jiangjie Wu, Lixuan Chen, Zhenghao Li, Xin Li, Saban Ozturk, Lihui Wang, Rongpin Wang, Hongjiang Wei, Yuyao Zhang
Title: SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction
Abstract:
High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external training datasets, which are difficult to obtain for clinical fetal MRI. To address this issue, we propose an unsupervised iterative SVR-SRR framework for isotropic HR volume reconstruction. Specifically, SVR is formulated as a function mapping a 2D slice and a 3D target volume to a rigid transformation matrix, which aligns the slice to the underlying location in the target volume. The function is parameterized by a convolutional neural network, which is trained by minimizing the difference between the volume slicing at the predicted position and the input slice. In SRR, a decoding network embedded within a deep image prior framework is incorporated with a comprehensive image degradation model to produce the high-resolution (HR) volume. The deep image prior framework offers a local consistency prior to guide the reconstruction of HR volumes. By performing a forward degradation model, the HR volume is optimized by minimizing loss between predicted slices and the observed slices. Comprehensive experiments conducted on large-magnitude motion-corrupted simulation data and clinical data demonstrate the superior performance of the proposed framework over state-of-the-art fetal brain reconstruction frameworks.
Chinese Summary: 本研究提出了一种无监督迭代框架,通过深度学习结合切片到体积配准和超分辨率重建技术,无需依赖大规模外部数据集即可从运动伪影的二维切片生成高质量三维胎儿脑部MRI,并在实验中展现出优于现有方法的性能。
English Summary: This study introduces an unsupervised iterative framework that combines slice-to-volume registration and super-resolution reconstruction using deep learning to produce high-resolution 3D fetal brain MRI from motion-corrupted 2D slices, eliminating the need for large external datasets while outperforming existing methods.

Authors:Boyuan Li, Yicheng Luo, Zhen Liu, Junhao Zheng, Jianming Lv, Qianli Ma
Title: HyperIMTS: Hypergraph Neural Network for Irregular Multivariate Time Series Forecasting
Abstract:
Irregular multivariate time series (IMTS) are characterized by irregular time intervals within variables and unaligned observations across variables, posing challenges in learning temporal and variable dependencies. Many existing IMTS models either require padded samples to learn separately from temporal and variable dimensions, or represent original samples via bipartite graphs or sets. However, the former approaches often need to handle extra padding values affecting efficiency and disrupting original sampling patterns, while the latter ones have limitations in capturing dependencies among unaligned observations. To represent and learn both dependencies from original observations in a unified form, we propose HyperIMTS, a Hypergraph neural network for Irregular Multivariate Time Series forecasting. Observed values are converted as nodes in the hypergraph, interconnected by temporal and variable hyperedges to enable message passing among all observations. Through irregularity-aware message passing, HyperIMTS captures variable dependencies in a time-adaptive way to achieve accurate forecasting. Experiments demonstrate HyperIMTS's competitive performance among state-of-the-art models in IMTS forecasting with low computational cost.
中文:HyperIMTS提出一种超图神经网络,通过时间和变量超边连接观测值,有效捕捉不规则多元时间序列中的依赖关系,以低计算成本实现精准预测。
English: HyperIMTS introduces a hypergraph neural network that models irregular multivariate time series by connecting observations through temporal and variable hyperedges, enabling efficient capture of dependencies for accurate forecasting with low computational cost.

Authors:Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, Min Yang
Title: ReasoningShield: Content Safety Detection over Reasoning Traces of Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities. While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe. Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces. After identifying the key challenges, we formally define the question-thought (QT) moderation task and propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer. To construct the model, we synthesize a high-quality reasoning safety detection dataset comprising over 8,000 question-thought pairs spanning ten risk categories and three safety levels. Our dataset construction process incorporates a comprehensive human-AI collaborative annotation pipeline, which achieves over 93% annotation accuracy while significantly reducing human costs. On a diverse set of in-distribution and out-of-distribution benchmarks, ReasoningShield outperforms mainstream content safety moderation models in identifying risks within reasoning traces, with an average F1 score exceeding 0.92. Notably, despite being trained on our QT dataset only, ReasoningShield also demonstrates competitive performance in detecting unsafe question-answer pairs on traditional benchmarks, rivaling baselines trained on 10 times larger datasets and base models, which strongly validates the quality of our dataset. Furthermore, ReasoningShield is built upon compact 1B/3B base models to facilitate lightweight deployment and provides human-friendly risk analysis by default. To foster future research, we publicly release all the resources.
中文摘要:ReasoningShield作为首个专门检测推理过程中潜在风险的安全模型,在保持轻量级部署的同时,能有效识别最终答案生成前的隐蔽风险,其性能显著优于现有主流内容审核工具。
English Summary: ReasoningShield is a pioneering safety detection model designed to identify hidden risks in reasoning traces before final answers, outperforming existing moderation tools with high accuracy and efficiency despite using compact base models.

Authors:Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang
Title: Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?
Abstract:
Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.
大语言模型与人类偏好的对齐常依赖集中于前段标记的浅层偏好信号,这使截断数据集训练更高效,但也引发了对全面对齐效果的质疑。
Aligning large language models with human preferences often relies on shallow preference signals concentrated in early tokens, enabling efficient training with truncated datasets while raising concerns about comprehensive alignment.

Authors:Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He
Title: Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
Abstract:
Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.
中文摘要:Transformer Copilot框架通过引入副驾驶模型,在联合训练和融合推理中利用错误日志修正主模型错误,在12个基准测试中性能提升最高达34.5%,且仅带来轻微计算开销。
English Summary: The Transformer Copilot framework enhances large language models by introducing a Copilot model that learns from a Mistake Log to rectify the Pilot model's errors during joint training and fused inference, achieving up to 34.5% performance improvement across 12 benchmarks with minimal computational overhead.

Authors:Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, Sujian Li
Title: KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization
Abstract:
Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.
推测解码通过小型草稿模型并行验证令牌来加速大语言模型推理,而KNN-SSD算法利用K近邻搜索匹配不同领域输入与跳跃层,显著提升了该方法的领域泛化能力与加速效果。
Speculative Decoding accelerates LLM inference by using a small draft model to propose tokens that are then verified in parallel, with KNN-SSD enhancing its domain adaptability for consistent speed improvements across diverse inputs.

Authors:Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang
Title: GRIT: Teaching MLLMs to Think with Images
Abstract:
Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.
中文: GRIT方法通过强化学习训练多模态大模型生成融合自然语言和视觉边界框坐标的推理链,仅需少量训练数据即可实现有效的视觉基础推理能力。
English: The GRIT method trains multimodal large language models to generate reasoning chains that interleave natural language with visual bounding box coordinates, using a reinforcement learning approach that requires minimal training data while achieving effective visual grounding.

Authors:Zhe Xu, Cheng Jin, Yihui Wang, Ziyi Liu, Hao Chen
Title: Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning
Abstract:
Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.
中文摘要:本文提出了一种新型双边强化学习框架,通过协同双分支结构增强病理图像分析的推理能力并优化计算效率,在多种诊断任务中实现了性能显著提升与推理成本大幅降低。
English Summary: A novel bilateral reinforcement learning framework is introduced to enhance reasoning capabilities and optimize computational efficiency in multimodal pathological image analysis, achieving significant performance improvements and reduced inference costs across various diagnostic tasks.

Authors:Youming Tao, Zuyuan Zhang, Dongxiao Yu, Xiuzhen Cheng, Falko Dressler, Di Wang
Title: Second-Order Convergence in Private Stochastic Non-Convex Optimization
Abstract:
We investigate the problem of finding second-order stationary points (SOSP) in differentially private (DP) stochastic non-convex optimization. Existing methods suffer from two key limitations: (i) inaccurate convergence error rate due to overlooking gradient variance in the saddle point escape analysis, and (ii) dependence on auxiliary private model selection procedures for identifying DP-SOSP, which can significantly impair utility, particularly in distributed settings. To address these issues, we propose a generic perturbed stochastic gradient descent (PSGD) framework built upon Gaussian noise injection and general gradient oracles. A core innovation of our framework is using model drift distance to determine whether PSGD escapes saddle points, ensuring convergence to approximate local minima without relying on second-order information or additional DP-SOSP identification. By leveraging the adaptive DP-SPIDER estimator as a specific gradient oracle, we develop a new DP algorithm that rectifies the convergence error rates reported in prior work. We further extend this algorithm to distributed learning with arbitrarily heterogeneous data, providing the first formal guarantees for finding DP-SOSP in such settings. Our analysis also highlights the detrimental impacts of private selection procedures in distributed learning under high-dimensional models, underscoring the practical benefits of our design. Numerical experiments on real-world datasets validate the efficacy of our approach.
中文摘要:本研究提出了一种基于高斯噪声和梯度查询的扰动随机梯度下降框架,可有效寻找差分隐私下的二阶稳定点,修正了现有收敛误差并扩展至分布式学习场景,首次为此类设置提供了正式保证。
English Summary: This study introduces a perturbed stochastic gradient descent framework using Gaussian noise and gradient oracles to efficiently find differentially private second-order stationary points in non-convex optimization, correcting prior convergence errors and extending to distributed learning with formal guarantees.

Authors:Kaiyuan Chen, Letian Fu, David Huang, Yanxiang Zhang, Lawrence Yunliang Chen, Huang Huang, Kush Hari, Ashwin Balakrishna, Ted Xiao, Pannag R Sanketi, John Kubiatowicz, Ken Goldberg
Title: Robo-DM: Data Management For Large Robot Datasets
Abstract:
Recent results suggest that very large datasets of teleoperated robot demonstrations can be used to train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient open-source cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can significantly reduce the size of robot trajectory data, transfer costs, and data load time during training. Compared to the RLDS format used in OXE datasets, Robo-DM's compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster when decoding sequentially. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer reduction in downstream task accuracy.
最新研究表明,大规模遥操作机器人数据集可用于训练具备跨场景、跨任务泛化能力的Transformer模型,但多模态轨迹数据的管理仍具挑战;Robo-DM开源云工具包通过EBML压缩技术实现高效数据管理,在保持任务精度的同时将存储空间减少高达70倍、数据读取速度提升50倍。
Recent advances show that large-scale teleoperated robot datasets can train transformer models for broad generalization, but managing such multimodal data remains difficult; Robo-DM, an open-source cloud toolkit, addresses this with efficient EBML-based compression—reducing storage by up to 70x and speeding data access by 50x while maintaining task accuracy.

Authors:Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
Title: Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Abstract:
Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
中文: 本文提出首个用于网页导航的过程奖励模型Web-Shepherd,通过大规模数据集实现步骤级轨迹评估,在准确性和成本效益上显著优于现有方法。
English: This paper introduces Web-Shepherd, the first process reward model for web navigation that evaluates step-level trajectories using a large-scale dataset and achieves superior accuracy and cost-efficiency compared to existing methods.

Authors:Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
Title: Scaling Diffusion Transformers Efficiently via $μ$P
Abstract:
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($μ$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$α$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$μ$P enjoys robust HP transferability. Notably, DiT-XL-2-$μ$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $μ$P on text-to-image generation by scaling PixArt-$α$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $μ$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$α$ and 3% of consumption by human experts for MMDiT-18B. These results establish $μ$P as a principled and efficient framework for scaling diffusion Transformers.
中文: 本研究成功将最大更新参数化(μP)推广至扩散Transformer模型,实现了超参数从小型到大型模型的稳定迁移,在显著降低调优成本的同时提升了多种架构的性能表现。
English: This study successfully adapts Maximal Update Parametrization (μP) to diffusion Transformers, enabling stable hyperparameter transfer from small to large models and significantly reducing tuning costs while improving performance across multiple architectures.

Authors:Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
Title: Scaling Diffusion Transformers Efficiently via $μ$P
Abstract:
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($μ$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$α$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$μ$P enjoys robust HP transferability. Notably, DiT-XL-2-$μ$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $μ$P on text-to-image generation by scaling PixArt-$α$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $μ$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$α$ and 3% of consumption by human experts for MMDiT-18B. These results establish $μ$P as a principled and efficient framework for scaling diffusion Transformers.
中文: 本研究成功将最大更新参数化(μP)推广至扩散Transformer模型,实现了超参数从小型到大型模型的稳定迁移,在显著降低调优成本的同时提升了多种架构的性能表现。
English: This study successfully adapts Maximal Update Parametrization (μP) to diffusion Transformers, enabling stable hyperparameter transfer from small to large models and significantly reducing tuning costs while improving performance across multiple architectures.

Authors:Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao
Title: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Abstract:
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
中文: LiveVLM是一种无需训练的新框架,通过创新的流式KV缓存实时处理视频流,显著提升内存效率和响应速度,同时保持模型性能。
English: LiveVLM is a training-free framework that enables real-time streaming video understanding by creating an innovative KV cache to process video streams efficiently, significantly improving memory usage and response speed while maintaining performance.

Authors:Roozbeh Aghili, Xingfang Wu, Foutse Khomh, Heng Li
Title: SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs
Abstract:
Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.
Chinese: 软件日志对系统维护至关重要,但因敏感数据带来的隐私风险而共享受限,为此提出的SDLog深度学习框架通过少量样本微调即可高精度识别敏感信息,超越了基于正则表达式的方法。
English: Software logs are essential for system maintenance but face limited sharing due to privacy risks from sensitive data, prompting the introduction of SDLog, a deep learning framework that outperforms regex-based methods by achieving high accuracy in identifying sensitive information with minimal fine-tuning.

Authors:Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao
Title: Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism
Abstract:
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.
中文: ParaStep是一种新型并行化方法,通过利用相邻去噪步骤间的相似性进行轻量级逐步通信,在保持生成质量的同时显著加速扩散模型推理。
English: ParaStep is a novel parallelization method that accelerates diffusion model inference by exploiting similarities between denoising steps with lightweight step-wise communication, achieving significant speedups while maintaining generation quality.

Authors:Yang Xiao, Rohan Kumar Das
Title: Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing
Abstract:
As deepfake speech becomes common and hard to detect, it is vital to trace its source. Recent work on audio deepfake source tracing (ST) aims to find the origins of synthetic or manipulated speech. However, ST models must adapt to learn new deepfake attacks while retaining knowledge of the previous ones. A major challenge is catastrophic forgetting, where models lose the ability to recognize previously learned attacks. Some continual learning methods help with deepfake detection, but multi-class tasks such as ST introduce additional challenges as the number of classes grows. To address this, we propose an analytic class incremental learning method called AnaST. When new attacks appear, the feature extractor remains fixed, and the classifier is updated with a closed-form analytical solution in one epoch. This approach ensures data privacy, optimizes memory usage, and is suitable for online training. The experiments carried out in this work show that our method outperforms the baselines.
Chinese: 针对音频深度伪造溯源中的灾难性遗忘问题,我们提出AnaST持续学习方法,该方法冻结特征提取器并采用解析解单轮更新分类器以应对新型攻击,在保证数据隐私和内存效率的同时显著优于基线模型。
English: To address catastrophic forgetting in audio deepfake source tracing, we propose AnaST, a continual learning method that freezes the feature extractor and analytically updates the classifier in one epoch for new attacks, outperforming baselines while ensuring data privacy and memory efficiency.

Authors:Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das
Title: AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation
Abstract:
Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study, we present AdaKWS, the first TTA method for robust KWS to the best of our knowledge. Specifically, 1) We initially optimize the model's confidence by selecting reliable samples based on prediction entropy minimization and adjusting the normalization statistics in each batch. 2) We introduce pseudo-keyword consistency (PKC) to identify critical, reliable features without overfitting to noise. Our experiments show that AdaKWS outperforms other methods across various conditions, including Gaussian noise and real-scenario noises. The code will be released in due course.
中文: AdaKWS提出了首个用于鲁棒语音关键词检测的测试时自适应方法,通过基于预测熵最小化的可靠样本选择和伪关键词一致性机制,有效提升了模型在未知环境和噪声场景下的性能。
English: AdaKWS introduces a test-time adaptation method for robust spoken keyword spotting by optimizing model confidence through entropy minimization and pseudo-keyword consistency to handle unseen environments and noise effectively.

Authors:Yingtao Luo, Shikai Fang, Binqing Wu, Qingsong Wen, Liang Sun
Title: Physics-Guided Learning of Meteorological Dynamics for Weather Downscaling and Forecasting
Abstract:
Weather forecasting is essential but remains computationally intensive and physically incomplete in traditional numerical weather prediction (NWP) methods. Deep learning (DL) models offer efficiency and accuracy but often ignore physical laws, limiting interpretability and generalization. We propose PhyDL-NWP, a physics-guided deep learning framework that integrates physical equations with latent force parameterization into data-driven models. It predicts weather variables from arbitrary spatiotemporal coordinates, computes physical terms via automatic differentiation, and uses a physics-informed loss to align predictions with governing dynamics. PhyDL-NWP enables resolution-free downscaling by modeling weather as a continuous function and fine-tunes pre-trained models with minimal overhead, achieving up to 170x faster inference with only 55K parameters. Experiments show that PhyDL-NWP improves both forecasting performance and physical consistency.
中文: PhyDL-NWP是一种物理引导的深度学习框架,将物理方程融入数据驱动的天气预报中,实现了高效、高分辨率的预测,并提升了准确性和物理一致性。
English: PhyDL-NWP is a physics-guided deep learning framework that integrates physical equations into data-driven weather forecasting, enabling efficient, high-resolution predictions with improved accuracy and physical consistency.

Authors:Aviv Navon, Aviv Shamsian, Yael Segal-Feldman, Neta Glazer, Gil Hetz, Joseph Keshet
Title: FlowTSE: Target Speaker Extraction with Flow Matching
Abstract:
Target speaker extraction (TSE) aims to isolate a specific speaker's speech from a mixture using speaker enrollment as a reference. While most existing approaches are discriminative, recent generative methods for TSE achieve strong results. However, generative methods for TSE remain underexplored, with most existing approaches relying on complex pipelines and pretrained components, leading to computational overhead. In this work, we present FlowTSE, a simple yet effective TSE approach based on conditional flow matching. Our model receives an enrollment audio sample and a mixed speech signal, both represented as mel-spectrograms, with the objective of extracting the target speaker's clean speech. Furthermore, for tasks where phase reconstruction is crucial, we propose a novel vocoder conditioned on the complex STFT of the mixed signal, enabling improved phase estimation. Experimental results on standard TSE benchmarks show that FlowTSE matches or outperforms strong baselines.
中文摘要:FlowTSE是一种基于条件流匹配的简单而有效的目标说话人提取方法,在标准基准测试中达到或超越了现有强基线性能,并提出了新型声码器以改进相位重建。
English Summary: FlowTSE is a simple yet effective target speaker extraction method using conditional flow matching that achieves competitive performance on standard benchmarks, while also introducing a novel vocoder for enhanced phase reconstruction.

Authors:Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen
Title: From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
Abstract:
Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a $2.5$D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-written instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.
Chinese: 指令调优的大型语言模型在空间基础任务中从合成指令泛化到人类编写指令方面存在困难,尽管使用合成数据进行了微调,但在复杂任务上表现显著下降。
English: Instruction-tuned LLMs struggle with generalizing from synthetic to human-authored instructions in spatial grounding tasks, showing significant performance degradation on complex tasks despite fine-tuning with synthetic data.

Authors:Nadir Durrani, Basel Mousi, Fahim Dalvi
Title: Editing Across Languages: A Survey of Multilingual Knowledge Editing
Abstract:
While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
中文摘要:本综述系统梳理了多语言知识编辑的研究进展,提出了完整的方法分类体系,同时指出了跨语言传播中的关键挑战,并为可编辑语言感知大模型的未来发展指明了待解决的问题。
English Summary: This survey systematizes research on Multilingual Knowledge Editing, presenting a comprehensive taxonomy of methods while identifying key challenges in cross-lingual propagation and highlighting open problems for future development.

Authors:Ni Ding, Miao Qiao, Jiaxing Xu, Yiping Ke, Xiaoyu Zhang
Title: $α$-GAN by Rényi Cross Entropy
Abstract:
This paper proposes $α$-GAN, a generative adversarial network using Rényi measures. The value function is formulated, by Rényi cross entropy, as an expected certainty measure incurred by the discriminator's soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the Rényi certainty about sample source, while the generator wants to reduce it by injecting fake samples. This forms a min-max problem with the solution parameterized by the Rényi order $α$. This $α$-GAN reduces to vanilla GAN at $α= 1$, where the value function is exactly the binary cross entropy. The optimization of $α$-GAN is over probability (vector) space. It is shown that the gradient is exponentially enlarged when Rényi order is in the range $α\in (0,1)$. This makes convergence faster, which is verified by experimental results. A discussion shows that choosing $α\in (0,1)$ may be able to solve some common problems, e.g., vanishing gradient. A following observation reveals that this range has not been fully explored in the existing Rényi version GANs.
中文: 本文提出α-GAN,一种利用Rényi测度的生成对抗网络,通过在(0,1)范围内优化Rényi阶数来加速收敛并解决梯度消失等问题,且在α=1时退化为标准GAN。
English: This paper introduces α-GAN, a generative adversarial network utilizing Rényi measures, which accelerates convergence and addresses issues like vanishing gradients by optimizing with Rényi orders in the range (0,1), generalizing to the standard GAN at α=1.

Authors:Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O. Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L. Beyrle, Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schnabel, Benedikt Wiestler
Title: NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI
Abstract:
In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously $unknown$ categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present $NOVA$, a challenging, real-life $evaluation-only$ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an $extreme$ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.
中文: NOVA是一个包含约900例脑部MRI扫描的评估基准,涵盖281种罕见病变,用于严格测试视觉语言模型在未经训练情况下对未知异常的检测、定位和推理能力,当前领先模型在此基准上均出现显著性能下降。
English: NOVA is a challenging evaluation benchmark of brain MRI scans designed to rigorously test out-of-distribution generalization in vision-language models by assessing their ability to detect, localize, and reason about rare pathologies without prior training, revealing significant performance drops in current models.

Authors:Kaya Stechly, Karthik Valmeekam, Atharva Gundawar, Vardhan Palod, Subbarao Kambhampati
Title: Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
Abstract:
Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces and which are claimed to display behaviors like backtracking, self-verification etc.-actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver (in our case, A* search). By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or "Chains of Thought" induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.
中文摘要:本研究通过证明准确的中间步骤对得出正确解答并非必需,甚至使用被破坏的推理痕迹也能保持或提升模型性能及泛化能力,从而对思维链在语言模型中的有效性提出质疑。
English Summary: This study challenges the effectiveness of Chain of Thought reasoning in language models by demonstrating that accurate intermediate steps are not necessary for correct solutions, and even corrupted traces can maintain or improve performance and generalization.

Authors:Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati
Title: RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs
Abstract:
Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn't quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of "RL generating longer thinking traces." While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.
中文摘要:本研究批判性地分析了基于强化学习的大语言模型后训练中的结构性假设,指出这些假设导致马尔可夫决策过程退化,实质上使该方法等同于结果驱动的监督学习,实验表明通过更简单的监督微调方法即可获得相当的性能。
English Summary: The study critically examines the structural assumptions in reinforcement learning-based post-training of large language models, showing they create a degenerate MDP that reduces the approach to outcome-driven supervised learning, with experiments demonstrating comparable performance through simpler supervised fine-tuning methods.

Authors:Krzysztof Lebioda, Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Andre Schamschurko, Alois Knoll
Title: Are requirements really all you need? A case study of LLM-driven configuration code generation for automotive simulations
Abstract:
Large Language Models (LLMs) are taking many industries by storm. They possess impressive reasoning capabilities and are capable of handling complex problems, as shown by their steadily improving scores on coding and mathematical benchmarks. However, are the models currently available truly capable of addressing real-world challenges, such as those found in the automotive industry? How well can they understand high-level, abstract instructions? Can they translate these instructions directly into functional code, or do they still need help and supervision? In this work, we put one of the current state-of-the-art models to the test. We evaluate its performance in the task of translating abstract requirements, extracted from automotive standards and documents, into configuration code for CARLA simulations.
中文: 本研究评估了顶尖大语言模型将抽象汽车需求转化为CARLA仿真功能代码的能力,对其在现实应用中的有效性提出质疑,尽管其在基准测试中表现优异。
English: This study evaluates a state-of-the-art LLM's ability to translate abstract automotive requirements into functional CARLA simulation code, questioning its real-world applicability despite strong benchmark performance.

Authors:Kevin Chenhao Li, Vahid Zolfaghari, Nenad Petrovic, Fengjunjie Pan, Alois Knoll
Title: Optimizing Retrieval Augmented Generation for Object Constraint Language
Abstract:
The Object Constraint Language (OCL) is essential for defining precise constraints within Model-Based Systems Engineering (MBSE). However, manually writing OCL rules is complex and time-consuming. This study explores the optimization of Retrieval-Augmented Generation (RAG) for automating OCL rule generation, focusing on the impact of different retrieval strategies. We evaluate three retrieval approaches $\unicode{x2013}$ BM25 (lexical-based), BERT-based (semantic retrieval), and SPLADE (sparse-vector retrieval) $\unicode{x2013}$ analyzing their effectiveness in providing relevant context for a large language model. To further assess our approach, we compare and benchmark our retrieval-optimized generation results against PathOCL, a state-of-the-art graph-based method. We directly compare BM25, BERT, and SPLADE retrieval methods with PathOCL to understand how different retrieval methods perform for a unified evaluation framework. Our experimental results, focusing on retrieval-augmented generation, indicate that while retrieval can enhance generation accuracy, its effectiveness depends on the retrieval method and the number of retrieved chunks (k). BM25 underperforms the baseline, whereas semantic approaches (BERT and SPLADE) achieve better results, with SPLADE performing best at lower k values. However, excessive retrieval with high k parameter can lead to retrieving irrelevant chunks which degrades model performance. Our findings highlight the importance of optimizing retrieval configurations to balance context relevance and output consistency. This research provides insights into improving OCL rule generation using RAG and underscores the need for tailoring retrieval.
本研究探讨了优化检索增强生成(RAG)来自动生成对象约束语言(OCL)规则,发现语义检索方法(如BERT和SPLADE)优于基于词汇的BM25,且性能高度依赖于检索上下文块的数量。
This study investigates optimizing Retrieval-Augmented Generation (RAG) for automating Object Constraint Language (OCL) rule generation, finding that semantic retrieval methods like BERT and SPLADE outperform lexical-based BM25, with performance highly dependent on the number of retrieved context chunks.

Authors:Guangda Liu, Chengwei Li, Zhenyu Ning, Minyi Guo, Jieru Zhao
Title: FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Abstract:
Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.
中文摘要:FreeKV是一种算法-系统协同优化框架,通过推测式检索和细粒度修正提升KV检索效率,同时保持近乎无损的精度,相比现有最优方法实现了高达13倍的加速效果。
English Summary: FreeKV is an algorithm-system co-optimization framework that enhances KV retrieval efficiency through speculative retrieval and fine-grained correction while maintaining near-lossless accuracy, achieving up to 13x speedup over state-of-the-art methods.

Authors:Ege Özsoy, Chantal Pellegrini, David Bani-Harouni, Kun Yuan, Matthias Keicher, Nassir Navab
Title: Specialized Foundation Models for Intelligent Operating Rooms
Abstract:
Surgical procedures unfold in complex environments demanding coordination between surgical teams, tools, imaging and increasingly, intelligent robotic systems. Ensuring safety and efficiency in ORs of the future requires intelligent systems, like surgical robots, smart instruments and digital copilots, capable of understanding complex activities and hazards of surgeries. Yet, existing computational approaches, lack the breadth, and generalization needed for comprehensive OR understanding. We introduce ORQA, a multimodal foundation model unifying visual, auditory, and structured data for holistic surgical understanding. ORQA's question-answering framework empowers diverse tasks, serving as an intelligence core for a broad spectrum of surgical technologies. We benchmark ORQA against generalist vision-language models, including ChatGPT and Gemini, and show that while they struggle to perceive surgical scenes, ORQA delivers substantially stronger, consistent performance. Recognizing the extensive range of deployment settings across clinical practice, we design, and release a family of smaller ORQA models tailored to different computational requirements. This work establishes a foundation for the next wave of intelligent surgical solutions, enabling surgical teams and medical technology providers to create smarter and safer operating rooms.
中文摘要:ORQA作为一种多模态基础模型,融合视觉、听觉和结构化数据,能全面理解手术场景,其性能显著优于通用AI模型,并为实现更智能安全的手术室提供可扩展解决方案。
English Summary: ORQA is a multimodal foundation model that integrates visual, auditory, and structured data to enable comprehensive surgical scene understanding, outperforming general AI models and offering scalable solutions for safer, more efficient operating rooms.

Authors:Yisheng Zhong, Yizhu Wen, Junfeng Guo, Mehran Kafai, Heng Huang, Hanqing Guo, Zhuangdi Zhu
Title: Web Intellectual Property at Risk: Preventing Unauthorized Real-Time Retrieval by Large Language Models
Abstract:
The protection of cyber Intellectual Property (IP) such as web content is an increasingly critical concern. The rise of large language models (LLMs) with online retrieval capabilities enables convenient access to information but often undermines the rights of original content creators. As users increasingly rely on LLM-generated responses, they gradually diminish direct engagement with original information sources, which will significantly reduce the incentives for IP creators to contribute, and lead to a saturating cyberspace with more AI-generated content. In response, we propose a novel defense framework that empowers web content creators to safeguard their web-based IP from unauthorized LLM real-time extraction and redistribution by leveraging the semantic understanding capability of LLMs themselves. Our method follows principled motivations and effectively addresses an intractable black-box optimization problem. Real-world experiments demonstrated that our methods improve defense success rates from 2.5% to 88.6% on different LLMs, outperforming traditional defenses such as configuration-based restrictions.
中文: 大型语言模型的普及通过减少对原创内容的直接访问而威胁网络知识产权,但新防御框架利用其自身语义理解能力保护网络知识产权,在真实测试中防御成功率最高达88.6%。
English: The proliferation of LLMs threatens cyber IP by reducing engagement with original content, but a new defense framework leverages LLMs' own semantic capabilities to protect web-based IP, achieving up to 88.6% success in real-world tests.

Authors:Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu
Title: MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
Abstract:
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
中文: 本研究推出MedAgentBoard基准评估医疗领域多智能体协作,发现其仅在临床工作流自动化等特定场景中具有优势,而在多数任务中被单一大型语言模型或传统方法超越,强调需根据具体任务选择AI解决方案。
English: The study introduces MedAgentBoard, a benchmark evaluating multi-agent collaboration in medicine, finding it beneficial only in specific tasks like workflow automation but often outperformed by single LLMs or conventional methods, highlighting the need for task-specific AI solutions.

Authors:Haiyu Deng, Yanna Jiang, Guangsheng Yu, Qin Wang, Xu Wang, Baihe Ma, Wei Ni, Ren Ping Liu
Title: PoLO: Proof-of-Learning and Proof-of-Ownership at Once with Chained Watermarking
Abstract:
Machine learning models are increasingly shared and outsourced, raising requirements of verifying training effort (Proof-of-Learning, PoL) to ensure claimed performance and establishing ownership (Proof-of-Ownership, PoO) for transactions. When models are trained by untrusted parties, PoL and PoO must be enforced together to enable protection, attribution, and compensation. However, existing studies typically address them separately, which not only weakens protection against forgery and privacy breaches but also leads to high verification overhead. We propose PoLO, a unified framework that simultaneously achieves PoL and PoO using chained watermarks. PoLO splits the training process into fine-grained training shards and embeds a dedicated watermark in each shard. Each watermark is generated using the hash of the preceding shard, certifying the training process of the preceding shard. The chained structure makes it computationally difficult to forge any individual part of the whole training process. The complete set of watermarks serves as the PoL, while the final watermark provides the PoO. PoLO offers more efficient and privacy-preserving verification compared to the vanilla PoL solutions that rely on gradient-based trajectory tracing and inadvertently expose training data during verification, while maintaining the same level of ownership assurance of watermark-based PoO schemes. Our evaluation shows that PoLO achieves 99% watermark detection accuracy for ownership verification, while preserving data privacy and cutting verification costs to just 1.5-10% of traditional methods. Forging PoLO demands 1.1-4x more resources than honest proof generation, with the original proof retaining over 90% detection accuracy even after attacks.
中文摘要:PoLO框架通过训练过程中的链式水印技术,将工作量证明与所有权证明有机结合,在保持原有所有权验证效果的同时,将验证成本降低至传统方法的1.5-10%,并有效防止隐私泄露和伪造行为。
English Summary: The PoLO framework integrates Proof-of-Learning and Proof-of-Ownership through chained watermarks during fine-grained training shards, providing robust protection against forgery while reducing verification costs by 85-98.5% and preserving data privacy compared to traditional methods.

Authors:Rui Qin, Qijie Wang, Ming Sun, Haowei Zhu, Chao Zhou, Bin Wang
Title: Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling
Abstract:
Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.
Chinese Summary: 本研究提出时空感知采样策略(TSS),通过动态分配计算资源到不同迭代阶段和图像区域,以一半的步骤实现超分辨率任务的当前最佳性能,且无需额外训练成本。
English Summary: This study introduces a Time-Spatial-aware Sampling strategy (TSS) that accelerates diffusion-based super-resolution by dynamically allocating computational resources across iterations and image regions, achieving state-of-the-art performance with half the steps and no extra training cost.

Authors:Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, Qianli Ma
Title: LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
Abstract:
Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.
中文: LifelongAgentBench是首个系统性评估大语言模型智能体终身学习能力的统一基准,它通过交互环境中的技能任务揭示了传统方法的局限性,并提出了有效的群体自洽机制以提升学习性能。
English: LifelongAgentBench is introduced as the first unified benchmark to systematically evaluate the lifelong learning abilities of LLM agents, featuring skill-grounded tasks across interactive environments and revealing the limitations of conventional methods while proposing an effective group self-consistency mechanism for improvement.

Authors:Yang Xiao, Tianyi Peng, Rohan Kumar Das, Yuchen Hu, Huiping Zhuang
Title: AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting
Abstract:
Keyword spotting (KWS) offers a vital mechanism to identify spoken commands in voice-enabled systems, where user demands often shift, requiring models to learn new keywords continually over time. However, a major problem is catastrophic forgetting, where models lose their ability to recognize earlier keywords. Although several continual learning methods have proven their usefulness for reducing forgetting, most existing approaches depend on storing and revisiting old data to combat catastrophic forgetting. Though effective, these methods face two practical challenges: 1) privacy risks from keeping user data and 2) large memory and time consumption that limit deployment on small devices. To address these issues, we propose an exemplar-free Analytic Continual Learning (AnalyticKWS) method that updates model parameters without revisiting earlier data. Inspired by efficient learning principles, AnalyticKWS computes a closed-form analytical solution for model updates and requires only a single epoch of adaptation for incoming keywords. AnalyticKWS demands fewer computational resources by avoiding gradient-based updates and does not store old data. By eliminating the need for back-propagation during incremental learning, the model remains lightweight and efficient. As a result, AnalyticKWS meets the challenges mentioned earlier and suits resource-limited settings well. Extensive experiments on various datasets and settings show that AnalyticKWS consistently outperforms existing continual learning methods.
Chinese: 提出的AnalyticKWS方法通过解析解实现单轮更新的持续关键词学习,无需存储旧数据即可有效防止灾难性遗忘,同时适用于资源有限的小型设备。
English: The proposed AnalyticKWS method enables continual keyword learning without storing old data by using an analytical solution for single-epoch updates, effectively preventing catastrophic forgetting while being resource-efficient for small devices.

Authors:Danilo de Oliveira, Julius Richter, Tal Peer, Timo Gerkmann
Title: LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models
Abstract:
We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 and TCD-TIMIT demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition (ASR). These findings are also supported by a formal listening experiment. Extensive ablation studies and cross-dataset evaluation confirm the effectiveness and generalization capabilities of our approach.
中文:LipDiffuser是一种条件扩散模型,通过结合视觉特征和说话人嵌入从无声视频生成自然清晰的语音,在语音质量和说话人相似度上优于现有方法,同时保持竞争力的自动语音识别性能。
English: LipDiffuser is a conditional diffusion model that generates natural and intelligible speech from silent videos using visual features and speaker embeddings, outperforming existing methods in speech quality and speaker similarity while maintaining competitive ASR performance.

Authors:Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, Zhifang Sui
Title: SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Abstract:
While reasoning models demonstrate exceptional performance on complex tasks, they often exhibit tendencies of overthinking on simple problems. This phenomenon not only leads to excessive computational resource consumption but also significantly degrades user experience. To address this challenge, we propose SelfBudgeter - a novel user-friendly adaptive controllable reasoning framework that incorporates a budget estimation mechanism prior to reasoning. The framework adopts a dual-phase training paradigm: during the cold-start phase, the model learns to predict token budgets before executing reasoning in a standardized format; in the reinforcement learning phase, the model is trained to autonomously plan budgets based on problem difficulty and strictly adhere to them when generating responses. Since the model outputs budget estimates at the initial stage, users can immediately anticipate waiting duration, enabling flexible decisions on whether to interrupt or continue the generation process. Notably, our method supports manual control of reasoning length through pre-filled budget fields. Experimental results demonstrate that SelfBudgeter can dynamically allocate budgets according to problem complexity, yielding an average response length compression of 61% for the 1.5B model on GSM8K, MATH500, and AIME2025, and 48% for the 7B model, while maintaining nearly undiminished accuracy.
中文摘要:SelfBudgeter是一种自适应推理框架,通过双阶段训练在推理前预测计算预算,既能动态分配资源实现显著响应压缩,又能保持准确率基本不变。
English Summary: SelfBudgeter is an adaptive reasoning framework that uses a dual-phase training approach to predict computational budgets before reasoning, enabling dynamic resource allocation and significant response compression while maintaining accuracy.

Authors:Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, Zhifang Sui
Title: SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Abstract:
While reasoning models demonstrate exceptional performance on complex tasks, they often exhibit tendencies of overthinking on simple problems. This phenomenon not only leads to excessive computational resource consumption but also significantly degrades user experience. To address this challenge, we propose SelfBudgeter - a novel user-friendly adaptive controllable reasoning framework that incorporates a budget estimation mechanism prior to reasoning. The framework adopts a dual-phase training paradigm: during the cold-start phase, the model learns to predict token budgets before executing reasoning in a standardized format; in the reinforcement learning phase, the model is trained to autonomously plan budgets based on problem difficulty and strictly adhere to them when generating responses. Since the model outputs budget estimates at the initial stage, users can immediately anticipate waiting duration, enabling flexible decisions on whether to interrupt or continue the generation process. Notably, our method supports manual control of reasoning length through pre-filled budget fields. Experimental results demonstrate that SelfBudgeter can dynamically allocate budgets according to problem complexity, yielding an average response length compression of 61% for the 1.5B model on GSM8K, MATH500, and AIME2025, and 48% for the 7B model, while maintaining nearly undiminished accuracy.
中文摘要:SelfBudgeter是一种自适应推理框架,通过双阶段训练在推理前预测计算预算,既能动态分配资源实现显著响应压缩,又能保持准确率基本不变。
English Summary: SelfBudgeter is an adaptive reasoning framework that uses a dual-phase training approach to predict computational budgets before reasoning, enabling dynamic resource allocation and significant response compression while maintaining accuracy.

Authors:Kasra Borazjani, Payam Abdisarabshali, Fardis Nadimi, Naji Khosravan, Minghui Liwang, Xianbin Wang, Yiguang Hong, Seyyedali Hosseinalipour
Title: Multi-Modal Multi-Task (M3T) Federated Foundation Models for Embodied AI: Potentials and Challenges for Edge Integration
Abstract:
As embodied AI systems become increasingly multi-modal, personalized, and interactive, they must learn effectively from diverse sensory inputs, adapt continually to user preferences, and operate safely under resource and privacy constraints. These challenges expose a pressing need for machine learning models capable of swift, context-aware adaptation while balancing model generalization and personalization. Here, two methods emerge as suitable candidates, each offering parts of these capabilities: multi-modal multi-task foundation models (M3T-FMs) provide a pathway toward generalization across tasks and modalities, whereas federated learning (FL) offers the infrastructure for distributed, privacy-preserving model updates and user-level model personalization. However, when used in isolation, each of these approaches falls short of meeting the complex and diverse capability requirements of real-world embodied AI environments. In this vision paper, we introduce multi-modal multi-task federated foundation models (M3T-FFMs) for embodied AI, a new paradigm that unifies the strengths of M3T-FMs with the privacy-preserving distributed training nature of FL, enabling intelligent systems at the wireless edge. We collect critical deployment dimensions of M3T-FFMs in embodied AI ecosystems under a unified framework, which we name "EMBODY": Embodiment heterogeneity, Modality richness and imbalance, Bandwidth and compute constraints, On-device continual learning, Distributed control and autonomy, and Yielding safety, privacy, and personalization. For each, we identify concrete challenges and envision actionable research directions. We also present an evaluation framework for deploying M3T-FFMs in embodied AI systems, along with the associated trade-offs. Finally, we present a prototype implementation of M3T-FFMs and evaluate their energy and latency performance.
中文摘要:本文提出多模态多任务联邦基础模型(M3T-FFMs)作为统一范式,将基础模型的泛化能力与联邦学习的隐私保护分布式训练相结合,并通过EMBODY框架从六个关键维度解决具身AI系统的部署挑战。
English Summary: This paper introduces multi-modal multi-task federated foundation models (M3T-FFMs) as a unified paradigm combining the generalization of foundation models with the privacy-preserving distributed training of federated learning, addressing six key deployment dimensions under the EMBODY framework for embodied AI systems.

Authors:Changyue Jiang, Xudong Pan, Min Yang
Title: Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
Abstract:
LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner modifies only the reasoning phase without altering the underlying agent framework, making it easy to deploy and widely applicable to various agent frameworks. To train the Thought-Aligner model, we construct an instruction dataset across ten representative scenarios and simulate ReAct execution trajectories, generating 5,000 diverse instructions and more than 11,400 safe and unsafe thought pairs. The model is fine-tuned using contrastive learning techniques. Experiments across three agent safety benchmarks involving 12 different LLMs demonstrate that Thought-Aligner raises agent behavioral safety from approximately 50% in the unprotected setting to 90% on average. Additionally, Thought-Aligner maintains response latency below 100ms with minimal resource usage, demonstrating its capability for efficient deployment, broad applicability, and timely responsiveness. This method thus provides a practical dynamic safety solution for the LLM-based agents.
中文:Thought-Aligner是一种轻量级插件模块,可在LLM智能体执行过程中动态修正高风险推理,在不改变核心框架的情况下将行为安全率从50%显著提升至90%。
English: Thought-Aligner is a lightweight, plug-in module that dynamically corrects high-risk reasoning in LLM-based agents during execution, significantly improving behavioral safety from 50% to 90% without altering core frameworks.

Authors:Chenhui Xu, Dancheng Liu, Amir Nassereldine, Jinjun Xiong
Title: FP64 is All You Need: Rethinking Failure Modes in Physics-Informed Neural Networks
Abstract:
Physics Informed Neural Networks (PINNs) often exhibit failure modes in which the PDE residual loss converges while the solution error stays large, a phenomenon traditionally blamed on local optima separated from the true solution by steep loss barriers. We challenge this understanding by demonstrate that the real culprit is insufficient arithmetic precision: with standard FP32, the LBFGS optimizer prematurely satisfies its convergence test, freezing the network in a spurious failure phase. Simply upgrading to FP64 rescues optimization, enabling vanilla PINNs to solve PDEs without any failure modes. These results reframe PINN failure modes as precision induced stalls rather than inescapable local minima and expose a three stage training dynamic unconverged, failure, success whose boundaries shift with numerical precision. Our findings emphasize that rigorous arithmetic precision is the key to dependable PDE solving with neural networks.
中文: PINN的失败源于算术精度不足导致优化器过早收敛,升级至FP64可解决此问题,确保神经网络可靠求解偏微分方程。
English: PINN failures stem from insufficient arithmetic precision causing premature optimizer convergence, which can be resolved by upgrading to FP64 to ensure reliable PDE solutions.

Authors:Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, Li Yi
Title: Unleashing Humanoid Reaching Potential via Real-world-Ready Skill Space
Abstract:
Humans possess a large reachable space in the 3D world, enabling interaction with objects at varying heights and distances. However, realizing such large-space reaching on humanoids is a complex whole-body control problem and requires the robot to master diverse skills simultaneously-including base positioning and reorientation, height and body posture adjustments, and end-effector pose control. Learning from scratch often leads to optimization difficulty and poor sim2real transferability. To address this challenge, we propose Real-world-Ready Skill Space (R2S2). Our approach begins with a carefully designed skill library consisting of real-world-ready primitive skills. We ensure optimal performance and robust sim2real transfer through individual skill tuning and sim2real evaluation. These skills are then ensembled into a unified latent space, serving as a structured prior that helps task execution in an efficient and sim2real transferable manner. A high-level planner, trained to sample skills from this space, enables the robot to accomplish real-world goal-reaching tasks. We demonstrate zero-shot sim2real transfer and validate R2S2 in multiple challenging goal-reaching scenarios.
Chinese Summary: 该研究提出了真实世界就绪技能空间(R2S2),通过将预调优的原始技能整合到统一潜空间中,使人形机器人能够执行复杂的全身伸展任务,并实现稳健的仿真到现实迁移。
English Summary: The study introduces Real-world-Ready Skill Space (R2S2), a framework that combines pre-tuned primitive skills into a unified latent space to enable humanoid robots to perform complex whole-body reaching tasks with robust sim2real transfer.

Authors:Zibin Dong, Fei Ni, Yifu Yuan, Yinchuan Li, Jianye Hao
Title: EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation
Abstract:
We present EmbodiedMAE, a unified 3D multi-modal representation for robot manipulation. Current approaches suffer from significant domain gaps between training datasets and robot manipulation tasks, while also lacking model architectures that can effectively incorporate 3D information. To overcome these limitations, we enhance the DROID dataset with high-quality depth maps and point clouds, constructing DROID-3D as a valuable supplement for 3D embodied vision research. Then we develop EmbodiedMAE, a multi-modal masked autoencoder that simultaneously learns representations across RGB, depth, and point cloud modalities through stochastic masking and cross-modal fusion. Trained on DROID-3D, EmbodiedMAE consistently outperforms state-of-the-art vision foundation models (VFMs) in both training efficiency and final performance across 70 simulation tasks and 20 real-world robot manipulation tasks on two robot platforms. The model exhibits strong scaling behavior with size and promotes effective policy learning from 3D inputs. Experimental results establish EmbodiedMAE as a reliable unified 3D multi-modal VFM for embodied AI systems, particularly in precise tabletop manipulation settings where spatial perception is critical.
中文: 我们提出EmbodiedMAE,一种统一的三维多模态模型,通过融合RGB、深度和点云数据学习表征,在仿真和真实机器人操作任务中均展现出卓越的效率和性能。
English: We introduce EmbodiedMAE, a unified 3D multi-modal model that learns from RGB, depth, and point cloud data, demonstrating superior efficiency and performance in both simulated and real-world robot manipulation tasks.

Authors:Justin Yu, Letian Fu, Huang Huang, Karim El-Refai, Rares Andrei Ambrus, Richard Cheng, Muhammad Zubair Irshad, Ken Goldberg
Title: Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware
Abstract:
Scaling robot learning requires vast and diverse datasets. Yet the prevailing data collection paradigm-human teleoperation-remains costly and constrained by manual effort and physical robot access. We introduce Real2Render2Real (R2R2R), a novel approach for generating robot training data without relying on object dynamics simulation or teleoperation of robot hardware. The input is a smartphone-captured scan of one or more objects and a single video of a human demonstration. R2R2R renders thousands of high visual fidelity robot-agnostic demonstrations by reconstructing detailed 3D object geometry and appearance, and tracking 6-DoF object motion. R2R2R uses 3D Gaussian Splatting (3DGS) to enable flexible asset generation and trajectory synthesis for both rigid and articulated objects, converting these representations to meshes to maintain compatibility with scalable rendering engines like IsaacLab but with collision modeling off. Robot demonstration data generated by R2R2R integrates directly with models that operate on robot proprioceptive states and image observations, such as vision-language-action models (VLA) and imitation learning policies. Physical experiments suggest that models trained on R2R2R data from a single human demonstration can match the performance of models trained on 150 human teleoperation demonstrations. Project page: https://real2render2real.com
中文: R2R2R是一种创新方法,通过将智能手机扫描和人类演示转化为高保真3D渲染,生成大量机器人训练数据,无需昂贵的人工遥操作,仅需单次演示即可达到与传统方法150次演示相当的性能。
English: R2R2R is a novel method that generates extensive robot training data by converting smartphone scans and human demonstrations into high-fidelity 3D renderings, eliminating the need for costly teleoperation while matching performance with far fewer human inputs.